FYS3150 - H?st 2014

Slides from FYS3150/4150 Lectures

Morten Hjorth-Jensen [1, 2]

[1] Department of Physics and Center of Mathematics for Applications, University of Oslo [2] National Superconducting Cyclotron Laboratory, Michigan State University

Sep 22, 2014

Week 34

Overview of week 34

Monday: First lecture: Presentation of the course, aims and content
Monday: Second Lecture: Introduction to C++ programming and numerical precision.
Tuesday: Numerical precision and C++ programming, continued and exercises for first week
Numerical differentiation and loss of numerical precision (chapter 3 lecture notes)
Computer lab: Thursday and Friday. First time: Thursday and Friday this week, Presentation of hardware and software at room FV329 first hour of every labgroup and solution of first simple exercises.

Lectures and ComputerLab

Lectures: Monday (4.15pm-6pm) and Tuesday (12.15pm-2pm) only this and next week. Thereafter weeks 38 and 39, and weeks 43 and 44 and finally weeks 47 and 48.
Weekly reading assignments needed to solve projects.
First hour of each lab session used to discuss technicalities, address questions etc linked with projects.
Detailed lecture notes, exercises, all programs presented, projects etc can be found at the homepage of the course.
Computerlab: Thursday (9am-7pm) and Friday (9am-7pm) room FV329.
Weekly plans and all other information are on the official webpage.
Final written exam December 12, 9am (four hours).

Course Format

Several computer exercises, 5 compulsory projects. Electronic reports only.
Evaluation and grading: The last project (50% of final grade) and a final written exam (50% of final grade). Final written exam December 12.
The computer lab (room FV329)consists of 16 Linux PCs, but many prefer own laptops. C/C++ is the default programming language, but Fortran2008 and Python are also used. All source codes discussed during the lectures can be found at the webpage of the course. We recommend either C/C++, Fortran2008 or Python as languages.

ComputerLab

day	teacher
Thursday 9am-1pm	Anders, Morten L., Håvard, MHJ
Thursday 1pm-5pm	Anders, Morten L., Håvard, MHJ
Friday 9am-1pm	Anders, Morten L., Håvard, MHJ
Friday 1pm-5pm	Anders, Morten L., Håvard, MHJ

Topics covered in this course

Numerical precision and intro to C++ programming
Numerical derivation and integration
Random numbers and Monte Carlo integration
Monte Carlo methods in statistical physics
Quantum Monte Carlo methods
Linear algebra and eigenvalue problems
Non-linear equations and roots of polynomials
Ordinary differential equations
Partial differential equations
Parallelization of codes
Programming av GPUs (optional)

Syllabus FYS3150

Linear algebra and eigenvalue problems, chapters 6 and 7.

Know Gaussian elimination and LU decomposition
How to solve linear equations
How to obtain the inverse and the determinant of a real symmetric matrix
Cholesky and tridiagonal matrix decomposition

Syllabus FYS3150

Linear algebra and eigenvalue problems, chapters 6 and 7.

Householder's tridiagonalization technique and finding eigenvalues based on this
Jacobi's method for finding eigenvalues
Singular value decomposition
Qubic Spline interpolation

Syllabus FYS3150

Numerical integration, standard methods and Monte Carlo methods (chapters 4 and 11).

Trapezoidal, rectangle and Simpson's rules
Gaussian quadrature, emphasis on Legendre polynomials, but you need to know about other polynomials as well.
Brute force Monte Carlo integration
Random numbers (simplest algo, ran0) and probability distribution functions, expectation values
Improved Monte Carlo integration and importance sampling.

Syllabus FYS3150

Monte Carlo methods in physics (chapters 12, 13, and 14).

Random walks and Markov chains and relation with diffusion equation
Metropolis algorithm, detailed balance and ergodicity
Simple spin systems and phase transitions
Variational Monte Carlo
How to construct trial wave functions for quantum systems

Syllabus FYS3150

Ordinary differential equations (chapters 8 and 9).

Euler's method and improved Euler's method, truncation errors
Runge Kutta methods, 2nd and 4th order, truncation errors
How to implement a second-order differential equation, both linear and non-linear. How to make your equations dimensionless.
Boundary value problems, shooting and matching method (chap 9).

Syllabus FYS3150

Partial differential equations, chapter 10.

Set up diffusion, Poisson and wave equations up to 2 spatial dimensions and time
Set up the mathematical model and algorithms for these equations, with boundary and initial conditions. Their stability conditions.
Explicit, implicit and Crank-Nicolson schemes, and how to solve them. Remember that they result in triangular matrices.
How to compute the Laplacian in Poisson's equation.
How to solve the wave equation in one and two dimensions.

Overarching aims of this course

Develop a critical approach to all steps in a project, which methods are most relevant, which natural laws and physical processes are important. Sort out initial conditions and boundary conditions etc.
This means to teach you structured scientific computing, learn to structure a project.
A critical understanding of central mathematical algorithms and methods from numerical analysis. In particular their limits and stability criteria.
Always try to find good checks of your codes (like solutions on closed form)
To enable you to develop a critical view on the mathematical model and the physics.

And, there is nothing like a code which gives correct results!!

J. J. Barton and L. R. Nackman,*Scientific and Engineering C++*, Addison Wesley, 3rd edition 2000.
B. Stoustrup, The C++ programming language, Pearson, 1997.
H. P. Langtangen INF-VERK3830 http://heim.ifi.uio.no/~hpl/INF-VERK4830/
D. Yang, C++ and Object-oriented Numeric Computing for Scientists and Engineers, Springer 2000.
More books reviewed at http:://www.accu.org/ and http://www.comeaucomputing.com/booklist/

Other courses in Computational Science at UiO

Bachelor/Master/PhD Courses.

INF-MAT4350 Numerical linear algebra
MAT-INF3300/3310, PDEs and Sobolev spaces I and II
INF-MAT3360 PDEs
INF5620 Numerical methods for PDEs, finite element method
FYS4411 Computational physics II (Parallelization (MPI), object orientation, quantum mechanical systems with many interacting particles), spring semester
FYS4460 Computational physics III (Parallelization (MPI), object orientation, classical statistical physics, simulation of phase transitions, spring semester
INF3331 Problem solving with high-level languages (Python), fall semester
INF3380 Parallel computing for problems in the Natural Sciences (mostly PDEs), spring semester

Extremely useful tools, strongly recommended

and discussed at the lab sessions the first week.

GIT for version control (see webpage)
ipython notebook
QTcreator for editing and mastering computational projects (for C++ codes, see webpage of course)
Armadillo as a useful numerical library for C++, highly recommended
Unit tests, see also webpage
Devilry for handing in projects

A structured programming approach

Before writing a single line, have the algorithm clarified and understood. It is crucial to have a logical structure of e.g., the flow and organization of data before one starts writing.
Always try to choose the simplest algorithm. Computational speed can be improved upon later.
Try to write a as clear program as possible. Such programs are easier to debug, and although it may take more time, in the long run it may save you time. If you collaborate with other people, it reduces spending time on debuging and trying to understand what the codes do. A clear program will also allow you to remember better what the program really does!

A structured programming approach

The planning of the program should be from top down to bottom, trying to keep the flow as linear as possible. Avoid jumping back and forth in the program. First you need to arrange the major tasks to be achieved. Then try to break the major tasks into subtasks. These can be represented by functions or subprograms. They should accomplish limited tasks and as far as possible be independent of each other. That will allow you to use them in other programs as well.
Try always to find some cases where an analytical solution exists or where simple test cases can be applied. If possible, devise different algorithms for solving the same problem. If you get the same answers, you may have coded things correctly or made the same error twice or more.

Getting Started

Compiling and linking, without QTcreator.

In order to obtain an executable file for a C++ program, the following instructions under Linux/Unix can be used

c++ -c -Wall myprogram.cpp
c++ -o myprogram myprogram.o

where the compiler is called through the command c++/g++. The compiler option -Wall means that a warning is issued in case of non-standard language. The executable file is in this case myprogram. The option -c is for compilation only, where the program is translated into machine code, while the -o option links the produced object file myprogram.o and produces the executable myprogram .

For Fortran2008 we use the Intel compiler, replace c++ with ifort. Also, to speed up the code use compile options like

c++ -O3 -c -Wall myprogram.cpp

Makefiles and simple scripts

Under Linux/Unix it is often convenient to create a so-called makefile, which is a script which includes possible compiling commands.

# Comment lines
# General makefile for c - choose PROG =   name of given program
# Here we define compiler option, libraries and the  target
CC= g++ -Wall
PROG= myprogram
# this is the math library in C, not necessary for C++
LIB = -lm
# Here we make the executable file
${PROG} :          ${PROG}.o
                   ${CC} ${PROG}.o ${LIB} -o ${PROG}
# whereas here we create the object file
${PROG}.o :       ${PROG}.c
                  ${CC} -c ${PROG}.c

If you name your file for makefile, simply type the command make and Linux/Unix executes all of the statements in the above makefile. Note that C++ files have the extension .cpp.

Hello world

The C encounter.

Here we present first the C version.

/* comments in C begin like this and end with */
#include <stdlib.h> /* atof function */
#include <math.h>   /* sine function */
#include <stdio.h>  /* printf function */
int main (int argc, char* argv[])
{
  double r, s;        /* declare variables */
  r = atof(argv[1]);  /* convert the text argv[1] to double */
  s = sin(r);
  printf("Hello, World! sin(%g)=%g\n", r, s);
  return 0;           /* success execution of the program */

Hello World

Dissection I.

The compiler must see a declaration of a function before you can call it (the compiler checks the argument and return types). The declaration of library functions appears in so-called "header files" that must be included in the program, e.g.,

   #include <stdlib.h> /* atof function */

We call three functions (atof, sin, printf) and these are declared in three different header files. The main program is a function called main with a return value set to an integer, int (0 if success). The operating system stores the return value, and other programs/utilities can check whether the execution was successful or not. The command-line arguments are transferred to the main function through

   int main (int argc, char* argv[])

Hello World

Dissection II.

The command-line arguments are transferred to the main function through

   int main (int argc, char* argv[])

The integer argc is the no of command-line arguments, set to one in our case, while argv is a vector of strings containing the command-line arguments with argv[0] containing the name of the program and argv[1], argv[2], ... are the command-line args, i.e., the number of lines of input to the program. Here we define floating points, see also below, through the keywords float for single precision real numbers and double for double precision. The function atof transforms a text (argv[1]) to a float. The sine function is declared in math.h, a library which is not automatically included and needs to be linked when computing an executable file.

With the command printf we obtain a formatted printout. The printf syntax is used for formatting output in many C-inspired languages (Perl, Python, Awk, partly C++).

Hello World

Now in C++.

Here we present first the C++ version.

// A comment line begins like this in C++ programs
// Standard ANSI-C++ include files
using namespace std
#include <iostream>  // input and output
int main (int argc, char* argv[])
{
  // convert the text argv[1] to double using atof:
  double r = atof(argv[1]);
  double s = sin(r);
  cout << "Hello, World! sin(" << r << ")=" << s << '\n';
  // success
  return 0;
}

C++ Hello World

Dissection I.

We have replaced the call to printf with the standard C++ function cout. The header file <iostream.h> is then needed. In addition, we don't need to declare variables like r and s at the beginning of the program. I personally prefer however to declare all variables at the beginning of a function, as this gives me a feeling of greater readability.

Brief summary

C/C++ program.

A C/C++ program begins with include statements of header files (libraries,intrinsic functions etc)
Functions which are used are normally defined at top (details next week)
The main program is set up as an integer, it returns 0 (everything correct) or 1 (something went wrong)
Standard if, while and for statements as in Java, Fortran, Python...
Integers have a very limited range.

Brief summary

Arrays.

A C/C++ array begins by indexing at 0!
Array allocations are done by size, not by the final index value.If you allocate an array with 10 elements, you should index them from $ 0,1,\dots, 9 $.
Initialize always an array before a computation.

Serious problems and representation of numbers

Integer and Real Numbers.

Overflow
Underflow
Roundoff errors
Loss of precision

Limits, you must declare variables

C++ and Fortran declarations.

type in C/C++ and Fortran2008	bits	range
`int/INTEGER` (2)	16	-32768 to 32767
`unsigned int`	16	0 to 65535
`signed int`	16	-32768 to 32767
`short int`	16	-32768 to 32767
`unsigned short int`	16	0 to 65535
`signed short int`	16	$ -32768 $ to 32767
`int/long int/INTEGER` (4)	32	-2147483648 to 2147483647
`signed long int`	32	-2147483648 to 2147483647
`float/REAL(4)`	32	$ 3.4\times 10^{-44} $ to $ 3.4\times 10^{+38} $
`double/REAL(8)`	64	$ 1.7\times 10^{-322} $ to $ 1.7\times 10^{+308} $
`long double`	64	$ 1.7\times 10^{-322} $ to $ 1.7\times 10^{+308} $

From decimal to binary representation

How to do it.

$$ a_n2^n+a_{n-1}2^{n-1} +a_{n-2}2^{n-2} +\dots +a_{0}2^{0}. $$

In binary notation we have thus $ (417)_{10} =(110110001)_2 $ since we have $$ (110100001)_2 =1\times2^8+1\times 2^{7}+0\times 2^{6}+1\times 2^{5}+0\times 2^{4}+0\times 2^{3}+0\times 2^{2}+0\times 2^{2}+0\times 2^{1}+1\times 2^{0}. $$ To see this, we have performed the following divisions by 2

417/2=208	remainder 1	coefficient of $ 2^{0} $ is 1
208/2=104	remainder 0	coefficient of $ 2^{1} $ is 0
104/2=52	remainder 0	coefficient of $ 2^{2} $ is 0
52/2=26	remainder 0	coefficient of $ 2^{3} $ is 0
26/2=13	remainder 1	coefficient of $ 2^{4} $ is 0
13/2= 6	remainder 1	coefficient of $ 2^{5} $ is 1
6/2= 3	remainder 0	coefficient of $ 2^{6} $ is 0
3/2= 1	remainder 1	coefficient of $ 2^{7} $ is 1
1/2= 0	remainder 1	coefficient of $ 2^{8} $ is 1

From decimal to binary representation

Integer numbers.

using namespace std;
#include <iostream>
int main (int argc, char* argv[])
{
  int i;
  int terms[32]; // storage of a0, a1, etc, up to 32 bits
  int number = atoi(argv[1]);
  // initialise the term a0, a1 etc
  for (i=0; i < 32 ; i++){ terms[i] = 0;}
  for (i=0; i < 32 ; i++){
    terms[i] = number%2;
    number /= 2;

  // write out results
  cout << "Number of bytes used= " << sizeof(number) << endl;
  for (i=0; i < 32 ; i++){
    cout << " Term nr: " << i << "Value= " << terms[i];
    cout << endl;

  return 0;

From decimal to binary representation

Integer numbers, Fortran.

PROGRAM binary_integer
IMPLICIT NONE
  INTEGER  i, number, terms(0:31) ! storage of a0, a1, etc, up to 32 bits

  WRITE(*,*) 'Give a number to transform to binary notation'
  READ(*,*) number
! Initialise the terms a0, a1 etc
  terms = 0
! Fortran takes only integer loop variables
  DO i=0, 31
     terms(i) = MOD(number,2)
     number = number/2
  ENDDO
! write out results
  WRITE(*,*) 'Binary representation '
  DO i=0, 31
    WRITE(*,*)' Term nr and value', i, terms(i)
  ENDDO

END PROGRAM binary_integer

Integer Numbers

Possible Overflow for Integers.

// A comment line begins like this in C++ programs
// Program to calculate 2**n
// Standard ANSI-C++ include files */
using namespace std
#include <iostream>
#include <cmath>
int main()
{
   int  int1, int2, int3;
// print to screen
   cout << "Read in the exponential N for 2^N =\n";
// read from screen
   cin >> int2;
   int1 = (int) pow(2., (double) int2);
   cout << " 2^N * 2^N = " << int1*int1 << "\n";
   int3 = int1 - 1;
   cout << " 2^N*(2^N - 1) = " << int1 * int3  << "\n";
   cout << " 2^N- 1 = " << int3  << "\n";
   return 0;

// End: program main()

Loss of Precision

Machine Numbers.

In the decimal system we would write a number like $ 9.90625 $ in what is called the normalized scientific notation. $$ 9.90625=0.990625\times 10^{1}, $$ and a real non-zero number could be generalized as $$ \begin{equation} x=\pm r\times 10^{{\mbox{n}}}, \end{equation} $$ with $ r $ a number in the range $ 1/10 \le r < 1 $. In a similar way we can use represent a binary number in scientific notation as $$ \begin{equation} x=\pm q\times 2^{{\mbox{m}}}, \end{equation} $$ with $ q $ a number in the range $ 1/2 \le q < 1 $. This means that the mantissa of a binary number would be represented by the general formula $$ \begin{equation} (0.a_{-1}a_{-2}\dots a_{-n})_2=a_{-1}\times 2^{-1} +a_{-2}\times 2^{-2}+\dots+a_{-n}\times 2^{-n}. \end{equation} $$

Loss of Precision

Machine Numbers.

In a typical computer, floating-point numbers are represented in the way described above, but with certain restrictions on $ q $ and $ m $ imposed by the available word length. In the machine, our number $ x $ is represented as $$ \begin{equation} x=(-1)^s\times {\mbox{mantissa}}\times 2^{{\mbox{exponent}}}, \end{equation} $$

where $ s $ is the sign bit, and the exponent gives the available range. With a single-precision word, 32 bits, 8 bits would typically be reserved for the exponent, 1 bit for the sign and 23 for the mantissa.

Loss of Precision

Machine Numbers.

A modification of the scientific notation for binary numbers is to require that the leading binary digit 1 appears to the left of the binary point. In this case the representation of the mantissa $ q $ would be $ (1.f)_2 $ and $ 1 \le q < 2$. This form is rather useful when storing binary numbers in a computer word, since we can always assume that the leading bit 1 is there. One bit of space can then be saved meaning that a 23 bits mantissa has actually 24 bits. This means explicitely that a binary number with 23 bits for the mantissa reads $$ \begin{equation} (1.a_{-1}a_{-2}\dots a_{-23})_2=1\times 2^0+a_{-1}\times 2^{-1}+ +a_{-2}\times 2^{-2}+\dots+a_{-23}\times 2^{-23}. \end{equation} $$ As an example, consider the 32 bits binary number $$ (10111110111101000000000000000000)_2, $$ where the first bit is reserved for the sign, 1 in this case yielding a negative sign. The exponent $ m $ is given by the next 8 binary numbers $ 01111101 $ resulting in 125 in the decimal system.

Loss of Precision

Machine Numbers.

However, since the exponent has eight bits, this means it has $ 2^8-1=255 $ possible numbers in the interval $ -128 \le m \le 127 $, our final exponent is $ 125-127=-2 $ resulting in $ 2^{-2} $. Inserting the sign and the mantissa yields the final number in the decimal representation as $$ -2^{-2}\left(1\times 2^0+1\times 2^{-1}+ 1\times 2^{-2}+1\times 2^{-3}+0\times 2^{-4}+1\times 2^{-5}\right)=$$ $$ (-0.4765625)_{10}. $$ In this case we have an exact machine representation with 32 bits (actually, we need less than 23 bits for the mantissa).

If our number $ x $ can be exactly represented in the machine, we call $ x $ a machine number. Unfortunately, most numbers cannot and are thereby only approximated in the machine. When such a number occurs as the result of reading some input data or of a computation, an inevitable error will arise in representing it as accurately as possible by a machine number.

Loss of Precision

Machine Numbers.

A floating number x, labelled $ fl(x) $ will therefore always be represented as $$ \begin{equation} fl(x) = x(1\pm \epsilon_x), \end{equation} $$ with $ x $ the exact number and the error $ |\epsilon_x| \le |\epsilon_M| $, where $ \epsilon_M $ is the precision assigned. A number like $ 1/10 $ has no exact binary representation with single or double precision. Since the mantissa $$ \left(1.a_{-1}a_{-2}\dots a_{-n}\right)_2 $$ is always truncated at some stage $ n $ due to its limited number of bits, there is only a limited number of real binary numbers. The spacing between every real binary number is given by the chosen machine precision. For a 32 bit words this number is approximately $ \epsilon_M \sim 10^{-7}$ and for double precision (64 bits) we have $ \epsilon_M \sim 10^{-16}$, or in terms of a binary base as $ 2^{-23} $ and $ 2^{-52} $ for single and double precision, respectively.

Loss of Precision

Machine Numbers.

In the machine a number is represented as $$ \begin{equation} fl(x)= x(1+\epsilon) \end{equation} $$

where $ |\epsilon| \leq \epsilon_M $ and $ \epsilon $ is given by the specified precision, $ 10^{-7} $ for single and $ 10^{-16} $ for double precision, respectively. $ \epsilon_M $ is the given precision. In case of a subtraction $ a=b-c $, we have $$ \begin{equation} fl(a)=fl(b)-fl(c) = a(1+\epsilon_a), \end{equation} $$ or $$ \begin{equation} fl(a)=b(1+\epsilon_b)-c(1+\epsilon_c), \end{equation} $$

meaning that $$ \begin{equation} fl(a)/a=1+\epsilon_b\frac{b}{a}- \epsilon_c\frac{c}{a}, \end{equation} $$

and if $ b\approx c $ we see that there is a potential for an increased error in $ fl(a) $.

Loss of Precision

Machine Numbers.

Define the absolute error as $$ \begin{equation} |fl(a)-a|, \end{equation} $$ whereas the relative error is $$ \begin{equation} \frac{ |fl(a)-a|}{a} \le \epsilon_a. \end{equation} $$ The above subraction is thus $$ \begin{equation} \frac{ |fl(a)-a|}{a}=\frac{ |fl(b)-fl(c)-(b-c)|}{a}, \end{equation} $$ yielding $$ \begin{equation} \frac{ |fl(a)-a|}{a}=\frac{ |b\epsilon_b- c\epsilon_c|}{a}. \end{equation} $$ The relative error is the quantity of interest in scientific work. Information about the absolute error is normally of little use in the absence of the magnitude of the quantity being measured.

Loss of numerical precision

Suppose we wish to evaluate the function $$ f(x)=\frac{1-\cos(x)}{\sin(x)}, $$ for small values of $ x $. Five leading digits. If we multiply the denominator and numerator with $ 1+\cos(x) $ we obtain the equivalent expression $$ f(x)=\frac{\sin(x)}{1+\cos(x)}. $$

If we now choose $ x=0.007 $ (in radians) our choice of precision results in $$ \sin(0.007)\approx 0.69999\times 10^{-2}, $$ and $$ \cos(0.007)\approx 0.99998. $$

Loss of numerical precision

The first expression for $ f(x) $ results in $$ f(x)=\frac{1-0.99998}{0.69999\times 10^{-2}}=\frac{0.2\times 10^{-4}}{0.69999\times 10^{-2}}=0.28572\times 10^{-2}, $$ while the second expression results in $$ f(x)=\frac{0.69999\times 10^{-2}}{1+0.99998}= \frac{0.69999\times 10^{-2}}{1.99998}=0.35000\times 10^{-2}, $$ which is also the exact result. In the first expression, due to our choice of precision, we have only one relevant digit in the numerator, after the subtraction. This leads to a loss of precision and a wrong result due to a cancellation of two nearly equal numbers. If we had chosen a precision of six leading digits, both expressions yield the same answer.

Loss of numerical precision

If we were to evaluate $ x\sim \pi $, then the second expression for $ f(x) $ can lead to potential losses of precision due to cancellations of nearly equal numbers.

This simple example demonstrates the loss of numerical precision due to roundoff errors, where the number of leading digits is lost in a subtraction of two near equal numbers. The lesson to be drawn is that we cannot blindly compute a function. We will always need to carefully analyze our algorithm in the search for potential pitfalls. There is no magic recipe however, the only guideline is an understanding of the fact that a machine cannot represent correctly all numbers.

Loss of precision can cuae serious problems

Real Numbers.

Overflow: When the positive exponent exceeds the max value, e.g., 308 for DOUBLE PRECISION (64 bits). Under such circumstances the program will terminate and some compilers may give you the warning OVERFLOW.
Underflow: When the negative exponent becomes smaller than the min value, e.g., -308 for DOUBLE PRECISION. Normally, the variable is then set to zero and the program continues. Other compilers (or compiler options) may warn you with the UNDERFLOW message and the program terminates.

Loss of precision, real numbers

Roundoff errors

A floating point number like $$ \begin{equation} x= 1.234567891112131468 = 0.1234567891112131468\times 10^{1} \end{equation} $$ may be stored in the following way. The exponent is small and is stored in full precision. However, the mantissa is not stored fully. In double precision (64 bits), digits beyond the 15th are lost since the mantissa is normally stored in two words, one which is the most significant one representing 123456 and the least significant one containing 789111213. The digits beyond 3 are lost. Clearly, if we are summing alternating series with large numbers, subtractions between two large numbers may lead to roundoff errors, since not all relevant digits are kept. This leads eventually to the next problem, namely

More on loss of precision

Real Numbers.

Loss of precision: When one has to e.g., multiply two large numbers where one suspects that the outcome may be beyond the bonds imposed by the variable declaration, one could represent the numbers by logarithms, or rewrite the equations to be solved in terms of dimensionless variables. When dealing with problems in e.g., particle physics or nuclear physics where distance is measured in fm ($ 10^{-15} $ m), it can be quite convenient to redefine the variables for distance in terms of a dimensionless variable of the order of unity. To give an example, suppose you work with single precision and wish to perform the addition $ 1+10^{-8} $. In this case, the information containing in $ 10^{-8} $ is simply lost in the addition. Typically, when performing the addition, the computer equates first the exponents of the two numbers to be added. For $ 10^{-8} $ this has however catastrophic consequences since in order to obtain an exponent equal to $ 10^0 $, bits in the mantissa are shifted to the right. At the end, all bits in the mantissa are zeros.

A problematic case

Three ways of computing $ e^{-x} $.

Brute force: $$\exp{(-x)}=\sum_{n=0}^{\infty}(-1)^n\frac{x^n}{n!}$$

Recursion relation for $$ \exp{(-x)}=\sum_{n=0}^{\infty}s_n=\sum_{n=0}^{\infty}(-1)^n\frac{x^n}{n!} $$ $$ s_n=-s_{n-1}\frac{x}{n}, $$ $$ \exp{(x)}=\sum_{n=0}^{\infty}s_n $$ $$ \exp{(-x)}=\frac{1}{\exp{(x)}} $$

Program to compute $ \exp{(-x)} $

Brute Force.

// Program to calculate function exp(-x)
// using straightforward summation with differing  precision
using namespace std
#include <iostream>
#include <cmath>
// type float:  32 bits precision
// type double: 64 bits precision
#define   TYPE          double
#define   PHASE(a)      (1 - 2 * (abs(a) % 2))
#define   TRUNCATION    1.0E-10
// function declaration
TYPE factorial(int);

Program to compute $ \exp{(-x)} $

Still Brute Force.

int main()
{
   int   n;
   TYPE  x, term, sum;
   for(x = 0.0; x < 100.0; x += 10.0)  {
     sum  = 0.0;                //initialization
     n    = 0;
     term = 1;
     while(fabs(term) > TRUNCATION)  {
         term =  PHASE(n) * (TYPE) pow((TYPE) x,(TYPE) n)
                / factorial(n);
         sum += term;
         n++;
     }  // end of while() loop

Program to compute $ \exp{(-x)} $

Oh, it never ends!

      printf("\nx = %4.1f   exp = %12.5E  series = %12.5E
              number of terms = %d",
              x, exp(-x), sum, n);
   } // end of for() loop

   printf("\n");           // a final line shift on output
   return 0;
} // End: function main()
//     The function factorial()
//     calculates and returns n!
TYPE factorial(int n)
{
   int  loop;
   TYPE fac;
   for(loop = 1, fac = 1.0; loop <= n; loop++)  {
      fac *= loop;

   return fac;
} // End: function factorial()

Results $ \exp{(-x)} $

What is going on?

$ x $	$ \exp{(-x)} $	Series	Number of terms in series
0.0	0.100000E+01	0.100000E+01	1
10.0	0.453999E-04	0.453999E-04	44
20.0	0.206115E-08	0.487460E-08	72
30.0	0.935762E-13	-0.342134E-04	100
40.0	0.424835E-17	-0.221033E+01	127
50.0	0.192875E-21	-0.833851E+05	155
60.0	0.875651E-26	-0.850381E+09	171
70.0	0.397545E-30	NaN	171
80.0	0.180485E-34	NaN	171
90.0	0.819401E-39	NaN	171
100.0	0.372008E-43	NaN	171

Program to compute $ \exp{(-x)} $

// program to compute exp(-x) without exponentials
using namespace std
#include <iostream>
#include <cmath>
#define  TRUNCATION     1.0E-10

int main()
{
   int       loop, n;
   double    x, term, sum;
   for(loop = 0; loop <= 100; loop += 10)
   {
     x    = (double) loop;          // initialization
     sum  = 1.0;
     term = 1;
     n    = 1;

Program to compute $ \exp{(-x)} $

Last statements.

     while(fabs(term) > TRUNCATION)
       {
	 term *= -x/((double) n);
	 sum  += term;
	 n++;
       } // end while loop
     cout << "x = " << x   << " exp = " << exp(-x) <<"series = "
          << sum  << " number of terms =" << n << "\n";
   } // end of for() loop

   cout << "\n";           // a final line shift on output

}  /*    End: function main() */

Results $ \exp{(-x)} $

More Problems.

$ x $	$ \exp{(-x)} $	Series	Number of terms in series
0.000000	0.10000000E+01	0.10000000E+01	1
10.000000	0.45399900E-04	0.45399900E-04	44
20.000000	0.20611536E-08	0.56385075E-08	72
30.000000	0.93576230E-13	-0.30668111E-04	100
40.000000	0.42483543E-17	-0.31657319E+01	127
50.000000	0.19287498E-21	0.11072933E+05	155
60.000000	0.87565108E-26	-0.33516811E+09	182
70.000000	0.39754497E-30	-0.32979605E+14	209
80.000000	0.18048514E-34	0.91805682E+17	237
90.000000	0.81940126E-39	-0.50516254E+22	264
100.000000	0.37200760E-43	-0.29137556E+26	291

Most used formula for derivatives

3 point formulae.

First derivative ($ f_0 = f(x_0) $, $ f_{-h}=f(x_0-h) $ and $ f_{+h}=f(x_0+h) $ $$ \frac{f_h-f_{-h}}{2h}=f'_0+\sum_{j=1}^{\infty}\frac{f_0^{(2j+1)}}{(2j+1)!}h^{2j}. $$ Second derivative $$ \frac{ f_h -2f_0 +f_{-h}}{h^2}=f_0''+2\sum_{j=1}^{\infty}\frac{f_0^{(2j+2)}}{(2j+2)!}h^{2j}. $$

Error Analysis

$$ \epsilon=log_{10}\left(\left|\frac{f''_{\mbox{computed}}-f''_{\mbox{exact}}} {f''_{\mbox{exact}}}\right|\right), $$ $$ \epsilon_{\mbox{tot}}=\epsilon_{\mbox{approx}}+\epsilon_{\mbox{ro}}. $$

For the computed second derivative we have $$ f_0''=\frac{ f_h -2f_0 +f_{-h}}{h^2}-2\sum_{j=1}^{\infty}\frac{f_0^{(2j+2)}}{(2j+2)!}h^{2j}, $$ and the truncation or approximation error goes like $$ \epsilon_{\mbox{approx}}\approx \frac{f_0^{(4)}}{12}h^{2}. $$

Error Analysis

If we were not to worry about loss of precision, we could in principle make $ h $ as small as possible. However, due to the computed expression in the above program example $$ f_0''=\frac{ f_h -2f_0 +f_{-h}}{h^2}=\frac{ (f_h -f_0) +(f_{-h}-f_0)}{h^2}, $$ we reach fairly quickly a limit for where loss of precision due to the subtraction of two nearly equal numbers becomes crucial.

If $ (f_{\pm h} -f_0) $ are very close, we have $ (f_{\pm h} -f_0)\approx \epsilon_M $, where $ |\epsilon_M|\le 10^{-7} $ for single and $ |\epsilon_M|\le 10^{-15} $ for double precision, respectively.

We have then $$ \left|f_0''\right|= \left|\frac{ (f_h -f_0) +(f_{-h}-f_0)}{h^2}\right|\le \frac{ 2 \epsilon_M}{h^2}. $$

Error Analysis

Our total error becomes $$ \left|\epsilon_{\mbox{tot}}\right|\le \frac{2 \epsilon_M}{h^2} + \frac{f_0^{(4)}}{12}h^{2}. \label{eq:experror} $$ It is then natural to ask which value of $ h $ yields the smallest total error. Taking the derivative of $ \left|\epsilon_{\mbox{tot}}\right| $ with respect to $ h $ results in $$ h= \left(\frac{ 24\epsilon_M}{f_0^{(4)}}\right)^{1/4}. $$ With double precision and $ x=10 $ we obtain $$ h\approx 10^{-4}. $$ Beyond this value, it is essentially the loss of numerical precision which takes over.

Error Analysis

Due to the subtractive cancellation in the expression for $ f'' $ there is a pronounced detoriation in accuracy as $ h $ is made smaller and smaller.

It is instructive in this analysis to rewrite the numerator of the computed derivative as $$ (f_h -f_0) +(f_{-h}-f_0)=(e^{x+h}-e^{x}) + (e^{x-h}-e^{x}), $$ as $$ (f_h -f_0) +(f_{-h}-f_0)=e^x(e^{h}+e^{-h}-2), $$ since it is the difference $ (e^{h}+e^{-h}-2) $ which causes the loss of precision.

Error Analysis

$ x $	$ h=0.01 $	$ h=0.001 $	$ h=0.0001 $	$ h=0.0000001 $	Exact
0.0	1.000008	1.000000	1.000000	1.010303	1.000000
1.0	2.718304	2.718282	2.718282	2.753353	2.718282
2.0	7.389118	7.389057	7.389056	7.283063	7.389056
3.0	20.085704	20.085539	20.085537	20.250467	20.085537
4.0	54.598605	54.598155	54.598151	54.711789	54.598150
5.0	148.414396	148.413172	148.413161	150.635056	148.413159

Error Analysis

The results for $ x=10 $ are shown in the Table

$ h $	$ e^{h}+e^{-h} $	$ e^{h}+e^{-h}-2 $
$ 10^{-1} $	2.0100083361116070	$ 1.0008336111607230\times 10^{-2} $
$ 10^{-2} $	2.0001000008333358	$ 1.0000083333605581\times 10^{-4} $
$ 10^{-3} $	2.0000010000000836	$ 1.0000000834065048\times 10^{-6} $
$ 10^{-5} $	2.0000000099999999	$ 1.0000000050247593\times 10^{-8} $
$ 10^{-5} $	2.0000000001000000	$ 9.9999897251734637\times 10^{-11} $
$ 10^{-6} $	2.0000000000010001	$ 9.9997787827987850\times 10^{-13} $
$ 10^{-7} $	2.0000000000000098	$ 9.9920072216264089\times 10^{-15} $
$ 10^{-8} $	2.0000000000000000	$ 0.0000000000000000\times 10^{0} $
$ 10^{-9} $	2.0000000000000000	$ 1.1102230246251565\times 10^{-16} $
$ 10^{-10} $	2.0000000000000000	$ 0.0000000000000000\times 10^{0} $

Week 35

Overview of week 35

Monday: Repetition from last week
Numerical differentiation
C/C++ programming details, pointers, read/write to/from file
Tuesday: Intro to linear Algebra and presentation of project 1.
Matrices in C++ and Fortran2008
Gaussian elimination and discussion of project 1.
Computer-Lab: start project 1. Reading asssignments and preparation for project 1: sections 2.5 and 3.1 for general C++ and Fortran features. Sections 6.1-6.4 (till page 182) are relevant for project 1.

Technical Matter in C/C++: Pointers

A pointer specifies where a value resides in the computer's memory (like a house number specifies where a particular family resides on a street).

A pointer points to an address not to a data container of any kind!

Simple example declarations:

  using namespace std; // note use of namespace
  int main()
 {
   // what are the differences?
   int var;
   cin >> var;
   int *p, q;
   int *s, *t;
   int * a new[var];    // dynamic memory allocation
   delete [] a;
}

Technical Matter in C/C++: Pointer example I

using namespace std; // note use of namespace
int main()
{
  int var;
  int *p;
  p = &var;
  var  = 421;
  printf("Address of integer variable var : %p\n",&var);
  printf("Its value: %d\n", var);
  printf("Value of integer pointer p : %p\n",p);
  printf("The value p points at :  %d\n",*p);
  printf("Address of the pointer p : %p\n",&p);
  return 0;
}

Dissection: Pointer example I

Discussion.

int main()
{
  int var;     // Define an integer variable var
  int *p;      // Define a pointer to an integer
  p = &var;    // Extract the address of var
  var = 421;   // Change content of var
  printf("Address of integer variable var : %p\n", &var);
  printf("Its value: %d\n", var);  // 421
  printf("Value of integer pointer p : %p\n", p);  // = &var
  // The content of the variable pointed to by p is *p
  printf("The value p points at :  %d\n", *p);
  // Address where the pointer is stored in memory
  printf("Address of the pointer p : %p\n", &p);
  return 0;
}

Pointer example II

int matr[2];
int *p;
p = &matr[0];
matr[0] = 321;
matr[1] = 322;
printf("\nAddress of matrix element matr[1]: %p",&matr[0]);
printf("\nValue of the  matrix element  matr[1]; %d",matr[0]);
printf("\nAddress of matrix element matr[2]: %p",&matr[1]);
printf("\nValue of the matrix element  matr[2]: %d\n", matr[1]);
printf("\nValue of the pointer p: %p",p);
printf("\nThe value p points to: %d",*p);
printf("\nThe value that (p+1) points to  %d\n",*(p+1));
printf("\nAddress of pointer p : %p\n",&p);

Dissection: Pointer example II

int matr[2];    // Define integer array with two elements
int *p;         // Define pointer to integer
p = &matr[0];   // Point to the address of the first element in matr
matr[0] = 321;  // Change the first element
matr[1] = 322;  // Change the second element
printf("\nAddress of matrix element matr[1]: %p", &matr[0]);
printf("\nValue of the  matrix element  matr[1]; %d", matr[0]);
printf("\nAddress of matrix element matr[2]: %p", &matr[1]);
printf("\nValue of the matrix element  matr[2]: %d\n", matr[1]);
printf("\nValue of the pointer p: %p", p);
printf("\nThe value p points to: %d", *p);
printf("\nThe value that (p+1) points to  %d\n", *(p+1));
printf("\nAddress of pointer p : %p\n", &p);

Output of Pointer example II

Address of the matrix element matr[1]: 0xbfffef70
Value of the  matrix element  matr[1]; 321
Address of the matrix element matr[2]: 0xbfffef74
Value of the matrix element  matr[2]: 322
Value of the pointer: 0xbfffef70
The value pointer points at: 321
The value that (pointer+1) points at:  322
Address of the pointer variable : 0xbfffef6c

File handling; C-way

using namespace std;
#include <iostream>
int main(int argc, char *argv[])
{
  FILE *in_file, *out_file;
  if( argc < 3)  {
    printf("The programs has the following structure :\n");
    printf("write in the name of the input and output files \n");
    exit(0);
  }
  in_file = fopen( argv[1], "r");// returns pointer to the  input file
  if( in_file == NULL )  { // NULL means that the file is missing
    printf("Can't find the input file %s\n", argv[1]);
    exit(0);

File handling; C way cont.

 out_file = fopen( argv[2], "w"); // returns a pointer to the output file
 if( out_file == NULL )  {       // can't find the file
    printf("Can't find the output file%s\n", argv[2]);
    exit(0);
  }
  fclose(in_file);
  fclose(out_file);
  return 0;

File handling, C++-way

#include <fstream>

// input and output file as global variable
ofstream ofile;
ifstream ifile;

File handling, C++-way

int main(int argc, char* argv[])
{
  char *outfilename;
  //Read in output file, abort if there are too
  //few command-line arguments
  if( argc <= 1 ){
    cout << "Bad Usage: " << argv[0] <<
      " read also output file on same line" << endl;
    exit(1);
  }
  else{
    outfilename=argv[1];
  }
  ofile.open(outfilename);
  .....
  ofile.close();  // close output file

File handling, C++-way

void output(double r_min , double r_max, int max_step,
            double *d)
{
int i;
ofile << "RESULTS:" << endl;
ofile << setiosflags(ios::showpoint | ios::uppercase);
ofile <<"R_min = " << setw(15) << setprecision(8) <<r_min <<endl;
ofile <<"R_max = " << setw(15) << setprecision(8) <<r_max <<endl;
ofile <<"Number of steps = " << setw(15) << max_step << endl;
ofile << "Five lowest eigenvalues:" << endl;
for(i = 0; i < 5; i++) {
    ofile << setw(15) << setprecision(8) << d[i] << endl;

}  // end of function output

File handling, C++-way

int main(int argc, char* argv[])
{
  char *infilename;
  // Read in input file, abort if there are too
  // few command-line arguments
  if( argc <= 1 ){
    cout << "Bad Usage: " << argv[0] <<
      " read also input file on same line" << endl;
    exit(1);
  }
  else{
    infilename=argv[1];
  }
  ifile.open(infilename);
  ....
  ifile.close();  // close input file

File handling, C++-way

const char* filename1 = "myfile";
ifstream ifile(filename1);
string filename2 = filename1 + ".out"
ofstream ofile(filename2);  // new output file
ofstream ofile(filename2, ios_base::app);  // append

//      Read something from the file:

double a; int b; char c[200];
ifile >> a >> b >> c;  // skips white space in between

//      Can test on success of reading:

if (!(ifile >> a >> b >> c)) ok = 0;

Call by value and reference

int main(int argc, char argv[]) {
int  a:
int *b;
a = 10;
b = new int[10];
for (i = 0; i < 10; i++) {
  b[i] = i;
}
func(a, b);
delete [] b;
return 0;
}

Call by value and reference

Morten: Too complicated LaTeX code for computer code to be decoded....

Call by value and reference

Lines 1,2: Declaration of two variables a and b. The compiler reserves two locations in memory. The size of the location depends on the type of variable. Two properties are important for these locations: the address in memory and the content in the location. The value of a is a. The address of a is &a. The value of b is *b. The address of b is &b.
Line 3: The value of a is now 10.
Line 4: Memory to store 10 integers is reserved. The address to the first location is stored in b. Address to element number 6 is given by the expression (b + 6).
Line 5: All 10 elements of b are given values: b[0] = 0, b[1] = 1, ....., b[9] = 9
line 7: here we deallocate the variable b.

Call by value and reference

Line 6: The main() function calls the function func() and the program counter transfers to the first statement in func(). With respect to data the following happens. The content of a (= 10) and the content of b (a memory address) are copied to a stack (new memory location) associated with the function func()
Line 7: The variable x and y are local variables in func(). They have the values x = 10, y is the address of the first element in b in main().
Line 8: The local variable x stored in the stack memory is changed to 17. Nothing happens with the value a in main().

Call by value and reference

Line 9: The value of y is an address and the symbol *y means the position in memory which has this address. The value in this location is now increased by 10. This means that the value of b[0] in the main program is equal to 10. Thus func() has modified a value in main().
Line 10: This statement has the same effect as line 9 except that it modifies the element b[6] in main() by adding a value of 10 to what was there originally, namely 5.
Line 11: The program counter returns to main(), the next expression after func(a,b). All data on the stack associated with func() are destroyed.

Call by value and reference

The value of a is transferred to func() and stored in a new memory location called x. Any modification of x in func() does not affect in any way the value of a in main(). This is called transfer of data by value.
On the other hand the next argument in func() is an address which is transferred to func(). This address can be used to modify the corresponding value in main(). In the C language it is expressed as a modification of the value which y points to, namely the first element of b. This is called transfer of data by reference and is a method to transfer data back to the calling function, in this case main().

Call by value and reference

C++ allows however the programmer to use solely call by reference (note that call by reference is implemented as pointers). To see the difference between C and C++, consider the following simple examples. In C we would write

   int n; n =8;
   func(&n); /* &n is a pointer to n */
   ....
   void func(int *i)
   {
     *i = 10; /* n is changed to 10 */
     ....
   }

whereas in C++ we would write

   int n; n =8;
   func(n); // just transfer n itself
   ....
   void func(int& i)
   {
     i = 10; // n is changed to 10
     ....
   }

The reason why we emphasize the difference between call by value and call by reference is that it allows the programmer to avoid pitfalls like unwanted changes of variables. However, many people feel that this reduces the readability of the code.

Call by value and reference, F90/95

In Fortran we can use INTENT(IN), INTENT(OUT), INTENT(INOUT) to let the program know which values should or should not be changed.

SUBROUTINE coulomb_integral(np,lp,n,l,coulomb)
  USE effective_interaction_declar
  USE energy_variables
  USE wave_functions
  IMPLICIT NONE
  INTEGER, INTENT(IN)  :: n, l, np, lp
  INTEGER :: i
  REAL(KIND=8), INTENT(INOUT) :: coulomb
  REAL(KIND=8) :: z_rel, oscl_r, sum_coulomb
  ...

This hinders unwanted changes and increases readability.

Important Matrix and vector handling packages

The Numerical Recipes codes have been rewritten in Fortran 90/95 and C/C++ by us. The original source codes are taken from the widely used software package LAPACK, which follows two other popular packages developed in the 1970s, namely EISPACK and LINPACK.

LINPACK: package for linear equations and least square problems.
LAPACK:package for solving symmetric, unsymmetric and generalized eigenvalue problems. From LAPACK's website http://www.netlib.org it is possible to download for free all source codes from this library. Both C/C++ and Fortran versions are available.
BLAS (I, II and III): (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Blas I is vector operations, II vector-matrix operations and III matrix-matrix operations. Highly parallelized and efficient codes, all available for download from http://www.netlib.org.

Basic Matrix Features

Matrix properties reminder.

$$ {\bf A} = \left( \begin{array}{cccc} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{array} \right)\qquad {\bf I} = \left( \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array} \right) $$

The inverse of a matrix is defined by $$ {\bf A}^{-1} \cdot {\bf A} = I $$

Basic Matrix Features

Matrix Properties Reminder.

Relations	Name	matrix elements
$ A = A^{T} $	symmetric	$ a_{ij} = a_{ji} $
$ A = \left (A^{T} \right )^{-1} $	real orthogonal	$ \sum_k a_{ik} a_{jk} = \sum_k a_{ki} a_{kj} = \delta_{ij} $
$ A = A^{*} $	real matrix	$ a_{ij} = a_{ij}^{*} $
$ A = A^{\dagger} $	hermitian	$ a_{ij} = a_{ji}^{*} $
$ A = \left (A^{\dagger} \right )^{-1} $	unitary	$ \sum_k a_{ik} a_{jk}^{} = \sum_k a_{ki}^{} a_{kj} = \delta_{ij} $

Some famous Matrices

Diagonal if $ a_{ij}=0 $ for $ i\ne j $
Upper triangular if $ a_{ij}=0 $ for $ i > j $
Lower triangular if $ a_{ij}=0 $ for $ i < j $
Upper Hessenberg if $ a_{ij}=0 $ for $ i > j+1 $
Lower Hessenberg if $ a_{ij}=0 $ for $ i < j+1 $
Tridiagonal if $ a_{ij}=0 $ for $ |i -j| > 1 $
Lower banded with bandwidth $ p $: $ a_{ij}=0 $ for $ i > j+p $
Upper banded with bandwidth $ p $: $ a_{ij}=0 $ for $ i < j+p $
Banded, block upper triangular, block lower triangular....

Basic Matrix Features

Some Equivalent Statements.

For an $ N\times N $ matrix $ {\bf A} $ the following properties are all equivalent

If the inverse of $ {\bf A} $ exists, $ {\bf A} $ is nonsingular.
The equation $ {\bf Ax}=0 $ implies $ {\bf x}=0 $.
The rows of $ {\bf A} $ form a basis of $ R^N $.
The columns of $ {\bf A} $ form a basis of $ R^N $.
$ {\bf A} $ is a product of elementary matrices.
$ 0 $ is not eigenvalue of $ {\bf A} $.

Important Mathematical Operations

The basic matrix operations that we will deal with are addition and subtraction $$ \begin{equation} {\bf A}= {\bf B}\pm{\bf C} \Longrightarrow a_{ij} = b_{ij}\pm c_{ij}, \label{eq:mtxadd} \end{equation} $$ scalar-matrix multiplication $$ \begin{equation} {\bf A}= \gamma{\bf B} \Longrightarrow a_{ij} = \gamma b_{ij}, \end{equation} $$ vector-matrix multiplication $$ \begin{equation} {\bf y}={\bf Ax} \Longrightarrow y_{i} = \sum_{j=1}^{n} a_{ij}x_j, \label{eq:vecmtx} \end{equation} $$ matrix-matrix multiplication $$ \begin{equation} {\bf A}={\bf BC} \Longrightarrow a_{ij} = \sum_{k=1}^{n} b_{ik}c_{kj}, \label{eq:mtxmtx} \end{equation} $$ and transposition $$ \begin{equation} {\bf A}={\bf B}^T \Longrightarrow a_{ij} = b_{ji} \end{equation} $$

Important Mathematical Operations

Similarly, important vector operations that we will deal with are addition and subtraction $$ \begin{equation} {\bf x}= {\bf y}\pm{\bf z} \Longrightarrow x_{i} = y_{i}\pm z_{i}, \end{equation} $$ scalar-vector multiplication $$ \begin{equation} {\bf x}= \gamma{\bf y} \Longrightarrow x_{i} = \gamma y_{i}, \end{equation} $$ vector-vector multiplication (called Hadamard multiplication) $$ \begin{equation} {\bf x}={\bf yz} \Longrightarrow x_{i} = y_{i}z_i, \end{equation} $$ the inner or so-called dot product resulting in a constant $$ \begin{equation} x={\bf y}^T{\bf z} \Longrightarrow x = \sum_{j=1}^{n} y_{j}z_{j}, \label{eq:innerprod} \end{equation} $$ and the outer product, which yields a matrix, $$ \begin{equation} {\bf A}= {\bf yz}^T \Longrightarrow a_{ij} = y_{i}z_{j}, \label{eq:outerprod} \end{equation} $$

Matrix Handling in C/C++, Static and Dynamical allocation

Static.

We have an $ N\times N $ matrix A with $ N=100 $ In C/C++ this would be defined as

   int N = 100;
   double A[100][100];
   //   initialize all elements to zero
   for(i=0 ; i < N ; i++) {
      for(j=0 ; j < N ; j++) {
         A[i][j] = 0.0;

Note the way the matrix is organized, row-major order.

Matrix Handling in C/C++

Row Major Order, Addition.

We have $ N\times N $ matrices A, B and C and we wish to evaluate $ A=B+C $. $$ {\bf A}= {\bf B}\pm{\bf C} \Longrightarrow a_{ij} = b_{ij}\pm c_{ij}, $$ In C/C++ this would be coded like

   for(i=0 ; i < N ; i++) {
      for(j=0 ; j < N ; j++) {
         a[i][j] = b[i][j]+c[i][j]

Matrix Handling in C/C++

Row Major Order, Multiplication.

We have $ N\times N $ matrices A, B and C and we wish to evaluate $ A=BC $. $$ {\bf A}={\bf BC} \Longrightarrow a_{ij} = \sum_{k=1}^{n} b_{ik}c_{kj}, $$ In C/C++ this would be coded like

   for(i=0 ; i < N ; i++) {
      for(j=0 ; j < N ; j++) {
         for(k=0 ; k < N ; k++) {
            a[i][j]+=b[i][k]*c[k][j];

Matrix Handling in Fortran 90/95

Column Major Order.

   ALLOCATE (a(N,N), b(N,N), c(N,N))
   DO j=1,  N
      DO i=1, N
         a(i,j)=b(i,j)+c(i,j)
      ENDDO
   ENDDO
   ...
   DEALLOCATE(a,b,c)

Fortran 90 writes the above statements in a much simpler way

   a=b+c

Multiplication

   a=MATMUL(b,c)

Fortran contains also the intrinsic functions TRANSPOSE and CONJUGATE.

Dynamic memory allocation in C/C++

At least three possibilities in this course

Do it yourself
Use the functions provided in the library package lib.cpp
Use Armadillo http://arma.sourceforgenet (a C++ linear algebra library, discussion next two weeks, both here and at lab). !split

Matrix Handling in C/C++, Dynamic Allocation

Do it yourself.

int N;
double **  A;
A = new double*[N]
for ( i = 0; i < N; i++)
    A[i] = new double[N];

Always free space when you don't need an array anymore.

for ( i = 0; i < N; i++)
    delete[] A[i];
delete[] A;

Armadillo, recommended!!

Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use. The syntax is deliberately similar to Matlab.
Integer, floating point and complex numbers are supported, as well as a subset of trigonometric and statistics functions. Various matrix decompositions are provided through optional integration with LAPACK, or one of its high performance drop-in replacements (such as the multi-threaded MKL or ACML libraries).
A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries. This is accomplished through recursive templates and template meta-programming.
Useful for conversion of research code into production environments, or if C++ has been decided as the language of choice, due to speed and/or integration capabilities.
The library is open-source software, and is distributed under a license that is useful in both open-source and commercial/proprietary contexts.

Armadillo, simple examples

#include <iostream>
#include <armadillo>

using namespace std;
using namespace arma;

int main(int argc, char** argv)
  {
  mat A = randu<mat>(5,5);
  mat B = randu<mat>(5,5);

  cout << A*B << endl;

  return 0;

Armadillo, how to compile and install

For people using Ubuntu, Debian, Linux Mint, simply go to the synaptic package manager and install armadillo from there. You may have to install Lapack as well. For Mac and Windows users, follow the instructions from the webpage http://arma.sourceforge.net. To compile, use for example

c++ -O2 -o program.x program.cpp  -larmadillo -llapack -lblas

where the -l option indicates the library you wish to link to.

Armadillo, simple examples

#include <iostream>
#include "armadillo"
using namespace arma;
using namespace std;

int main(int argc, char** argv)
  {
  // directly specify the matrix size (elements are uninitialised)
  mat A(2,3);
  // .n_rows = number of rows    (read only)
  // .n_cols = number of columns (read only)
  cout << "A.n_rows = " << A.n_rows << endl;
  cout << "A.n_cols = " << A.n_cols << endl;
  // directly access an element (indexing starts at 0)
  A(1,2) = 456.0;
  A.print("A:");
  // scalars are treated as a 1x1 matrix,
  // hence the code below will set A to have a size of 1x1
  A = 5.0;
  A.print("A:");
  // if you want a matrix with all elements set to a particular value
  // the .fill() member function can be used
  A.set_size(3,3);
  A.fill(5.0);  A.print("A:");

Armadillo, simple examples

  mat B;

  // endr indicates "end of row"
  B << 0.555950 << 0.274690 << 0.540605 << 0.798938 << endr
    << 0.108929 << 0.830123 << 0.891726 << 0.895283 << endr
    << 0.948014 << 0.973234 << 0.216504 << 0.883152 << endr
    << 0.023787 << 0.675382 << 0.231751 << 0.450332 << endr;

  // print to the cout stream
  // with an optional string before the contents of the matrix
  B.print("B:");

  // the << operator can also be used to print the matrix
  // to an arbitrary stream (cout in this case)
  cout << "B:" << endl << B << endl;
  // save to disk
  B.save("B.txt", raw_ascii);
  // load from disk
  mat C;
  C.load("B.txt");
  C += 2.0 * B;
  C.print("C:");

Armadillo, simple examples

  // submatrix types:
  //
  // .submat(first_row, first_column, last_row, last_column)
  // .row(row_number)
  // .col(column_number)
  // .cols(first_column, last_column)
  // .rows(first_row, last_row)

  cout << "C.submat(0,0,3,1) =" << endl;
  cout << C.submat(0,0,3,1) << endl;

  // generate the identity matrix
  mat D = eye<mat>(4,4);

  D.submat(0,0,3,1) = C.cols(1,2);
  D.print("D:");

  // transpose
  cout << "trans(B) =" << endl;
  cout << trans(B) << endl;

  // maximum from each column (traverse along rows)
  cout << "max(B) =" << endl;
  cout << max(B) << endl;

Armadillo, simple examples

  // maximum from each row (traverse along columns)
  cout << "max(B,1) =" << endl;
  cout << max(B,1) << endl;
  // maximum value in B
  cout << "max(max(B)) = " << max(max(B)) << endl;
  // sum of each column (traverse along rows)
  cout << "sum(B) =" << endl;
  cout << sum(B) << endl;
  // sum of each row (traverse along columns)
  cout << "sum(B,1) =" << endl;
  cout << sum(B,1) << endl;
  // sum of all elements
  cout << "sum(sum(B)) = " << sum(sum(B)) << endl;
  cout << "accu(B)     = " << accu(B) << endl;
  // trace = sum along diagonal
  cout << "trace(B)    = " << trace(B) << endl;
  // random matrix -- values are uniformly distributed in the [0,1] interval
  mat E = randu<mat>(4,4);
  E.print("E:");

Armadillo, simple examples

  // row vectors are treated like a matrix with one row
  rowvec r;
  r << 0.59499 << 0.88807 << 0.88532 << 0.19968;
  r.print("r:");

  // column vectors are treated like a matrix with one column
  colvec q;
  q << 0.81114 << 0.06256 << 0.95989 << 0.73628;
  q.print("q:");

  // dot or inner product
  cout << "as_scalar(r*q) = " << as_scalar(r*q) << endl;

    // outer product
  cout << "q*r =" << endl;
  cout << q*r << endl;


  // sum of three matrices (no temporary matrices are created)
  mat F = B + C + D;
  F.print("F:");

    return 0;

Armadillo, simple examples

#include <iostream>
#include "armadillo"
using namespace arma;
using namespace std;

int main(int argc, char** argv)
  {
  cout << "Armadillo version: " << arma_version::as_string() << endl;

  mat A;

  A << 0.165300 << 0.454037 << 0.995795 << 0.124098 << 0.047084 << endr
    << 0.688782 << 0.036549 << 0.552848 << 0.937664 << 0.866401 << endr
    << 0.348740 << 0.479388 << 0.506228 << 0.145673 << 0.491547 << endr
    << 0.148678 << 0.682258 << 0.571154 << 0.874724 << 0.444632 << endr
    << 0.245726 << 0.595218 << 0.409327 << 0.367827 << 0.385736 << endr;

  A.print("A =");

  // determinant
  cout << "det(A) = " << det(A) << endl;

Armadillo, simple examples

  // inverse
  cout << "inv(A) = " << endl << inv(A) << endl;
  double k = 1.23;

  mat    B = randu<mat>(5,5);
  mat    C = randu<mat>(5,5);

  rowvec r = randu<rowvec>(5);
  colvec q = randu<colvec>(5);


  // examples of some expressions
  // for which optimised implementations exist
  // optimised implementation of a trinary expression
  // that results in a scalar
  cout << "as_scalar( r*inv(diagmat(B))*q ) = ";
  cout << as_scalar( r*inv(diagmat(B))*q ) << endl;

  // example of an expression which is optimised
  // as a call to the dgemm() function in BLAS:
  cout << "k*trans(B)*C = " << endl << k*trans(B)*C;

    return 0;

Gaussian Elimination

We start with the linear set of equations $$ {\bf A}{\bf x} = {\bf w}. $$ We assume also that the matrix $ {\bf A} $ is non-singular and that the matrix elements along the diagonal satisfy $ a_{ii} \ne 0 $. Simple $ 4\times 4 $ example $$ \left(\begin{array}{cccc} a_{11}& a_{12} &a_{13}& a_{14}\\ a_{21}& a_{22} &a_{23}& a_{24}\\ a_{31}& a_{32} &a_{33}& a_{34}\\ a_{41}& a_{42} &a_{43}& a_{44}\\ \end{array} \right)\left(\begin{array}{c} x_1\\ x_2\\ x_3 \\ x_4 \\ \end{array} \right) =\left(\begin{array}{c} w_1\\ w_2\\ w_3 \\ w_4\\ \end{array} \right). $$ or $$ \begin{align} a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=&w_1 \nonumber \\ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=&w_2 \nonumber \\ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=&w_3 \nonumber \\ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=&w_4. \nonumber \end{align} $$

Gaussian Elimination

The basic idea of Gaussian elimination is to use the first equation to eliminate the first unknown $ x_1 $ from the remaining $ n-1 $ equations. Then we use the new second equation to eliminate the second unknown $ x_2 $ from the remaining $ n-2 $ equations. With $ n-1 $ such eliminations we obtain a so-called upper triangular set of equations of the form $$ \begin{align}label{eq:gaussbacksub} b_{11}x_1 +b_{12}x_2 +b_{13}x_3 + b_{14}x_4=&y_1 \nonumber \\ b_{22}x_2 + b_{23}x_3 + b_{24}x_4=&y_2 \nonumber \\ b_{33}x_3 + b_{34}x_4=&y_3 \nonumber \\ b_{44}x_4=&y_4. \nonumber \end{align} $$ We can solve this system of equations recursively starting from $ x_n $ (in our case $ x_4 $) and proceed with what is called a backward substitution. This process can be expressed mathematically as $$ \begin{equation} x_m = \frac{1}{b_{mm}}\left(y_m-\sum_{k=m+1}^nb_{mk}x_k\right)\quad m=n-1,n-2,\dots,1. \end{equation} $$ To arrive at such an upper triangular system of equations, we start by eliminating the unknown $ x_1 $ for $ j=2,n $. We achieve this by multiplying the first equation by $ a_{j1}/a_{11} $ and then subtract the result from the $j$th equation. We assume obviously that $ a_{11}\ne 0 $ and that $ {\bf A} $ is not singular.

Gaussian Elimination

Our actual $ 4\times 4 $ example reads after the first operation $$ \left(\begin{array}{cccc} a_{11}& a_{12} &a_{13}& a_{14}\\ 0& (a_{22}-\frac{a_{21}a_{12}}{a_{11}}) &(a_{23}-\frac{a_{21}a_{13}}{a_{11}}) & (a_{24}-\frac{a_{21}a_{14}}{a_{11}})\\ 0& (a_{32}-\frac{a_{31}a_{12}}{a_{11}})& (a_{33}-\frac{a_{31}a_{13}}{a_{11}})& (a_{34}-\frac{a_{31}a_{14}}{a_{11}})\\ 0&(a_{42}-\frac{a_{41}a_{12}}{a_{11}}) &(a_{43}-\frac{a_{41}a_{13}}{a_{11}}) & (a_{44}-\frac{a_{41}a_{14}}{a_{11}}) \\ \end{array} \right)\left(\begin{array}{c} x_1\\ x_2\\ x_3 \\ x_4 \\ \end{array} \right) =\left(\begin{array}{c} y_1\\ w_2^{(2)}\\ w_3^{(2)} \\ w_4^{(2)}\\ \end{array} \right), $$ or $$ \begin{align} b_{11}x_1 +b_{12}x_2 +b_{13}x_3 + b_{14}x_4=&y_1 \nonumber \\ a^{(2)}_{22}x_2 + a^{(2)}_{23}x_3 + a^{(2)}_{24}x_4=&w^{(2)}_2 \nonumber \\ a^{(2)}_{32}x_2 + a^{(2)}_{33}x_3 + a^{(2)}_{34}x_4=&w^{(2)}_3 \nonumber \\ a^{(2)}_{42}x_2 + a^{(2)}_{43}x_3 + a^{(2)}_{44}x_4=&w^{(2)}_4, \nonumber \\ \end{align} $$

Gaussian Elimination

The new coefficients are $$ \begin{equation} b_{1k} = a_{1k}^{(1)} \quad k=1,\dots,n, \end{equation} $$ where each $ a_{1k}^{(1)} $ is equal to the original $ a_{1k} $ element. The other coefficients are $$ \begin{equation} a_{jk}^{(2)} = a_{jk}^{(1)}-\frac{a_{j1}^{(1)}a_{1k}^{(1)}}{a_{11}^{(1)}} \quad j,k=2,\dots,n, \end{equation} $$ with a new right-hand side given by $$ \begin{equation} y_{1}=w_1^{(1)}, \quad w_j^{(2)} =w_j^{(1)}-\frac{a_{j1}^{(1)}w_1^{(1)}}{a_{11}^{(1)}} \quad j=2,\dots,n. \end{equation} $$ We have also set $ w_1^{(1)}=w_1 $, the original vector element. We see that the system of unknowns $ x_1,\dots,x_n $ is transformed into an $ (n-1)\times (n-1) $ problem.

Gaussian Elimination

This step is called forward substitution. Proceeding with these substitutions, we obtain the general expressions for the new coefficients $$ \begin{equation} a_{jk}^{(m+1)} = a_{jk}^{(m)}-\frac{a_{jm}^{(m)}a_{mk}^{(m)}}{a_{mm}^{(m)}} \quad j,k=m+1,\dots,n, \end{equation} $$ with $ m=1,\dots,n-1 $ and a right-hand side given by $$ \begin{equation} w_j^{(m+1)} =w_j^{(m)}-\frac{a_{jm}^{(m)}w_m^{(m)}}{a_{mm}^{(m)}}\quad j=m+1,\dots,n. \end{equation} $$ This set of $ n-1 $ elimations leads us to an equations which is solved by back substitution. If the arithmetics is exact and the matrix $ {\bf A} $ is not singular, then the computed answer will be exact.

Even though the matrix elements along the diagonal are not zero, numerically small numbers may appear and subsequent divisions may lead to large numbers, which, if added to a small number may yield losses of precision. Suppose for example that our first division in $ (a_{22}-a_{21}a_{12}/a_{11}) $ results in $ -10^{-7} $ and that $ a_{22} $ is one. one. We are then adding $ 10^7+1 $. With single precision this results in $ 10^7 $.

Gaussian Elimination and Tridiagonal matrices, project 1

Suppose we want to solve the following boundary value equation $$ -\frac{d^2u(x)}{dx^2} = f(x,u(x)), $$ with $ x\in (a,b) $ and with boundary conditions $ u(a)=u(b) = 0 $. We assume that $ f $ is a continuous function in the domain $ x\in (a,b) $. Since, except the few cases where it is possible to find analytic solutions, we will seek after approximate solutions, we choose to represent the approximation to the second derivative from the previous chapter $$ f''=\frac{f_h -2f_0 +f_{-h}}{h^2} +O(h^2). $$ We subdivide our interval $ x\in (a,b) $ into $ n $ subintervals by setting $ x_i = ih $, with $ i=0,1,\dots,n+1 $. The step size is then given by $ h=(b-a)/(n+1) $ with $ n\in {\mathbb{N}} $. For the internal grid points $ i=1,2,\dots n $ we replace the differential operator with the above formula resulting in $$ u''(x_i) \approx \frac{u(x_i+h) -2u(x_i) +u(x_i-h)}{h^2}, $$ which we rewrite as $$ u^{''}_i \approx \frac{u_{i+1} -2u_i +u_{i-i}}{h^2}. $$

Gaussian Elimination and Tridiagonal matrices, project 1

We can rewrite our original differential equation in terms of a discretized equation with approximations to the derivatives as $$ -\frac{u_{i+1} -2u_i +u_{i-i}}{h^2}=f(x_i,u(x_i)), $$ with $ i=1,2,\dots, n $. We need to add to this system the two boundary conditions $ u(a) =u_0 $ and $ u(b) = u_{n+1} $. If we define a matrix $$ {\bf A} = \frac{1}{h^2}\left(\begin{array}{cccccc} 2 & -1 & & & & \\ -1 & 2 & -1 & & & \\ & -1 & 2 & -1 & & \\ & \dots & \dots &\dots &\dots & \dots \\ & & &-1 &2& -1 \\ & & & &-1 & 2 \\ \end{array} \right) $$ and the corresponding vectors $ {\bf u} = (u_1, u_2, \dots,u_n)^T $ and $ {\bf f}({\bf u}) = f(x_1,x_2,\dots, x_n,u_1, u_2, \dots,u_n)^T $ we can rewrite the differential equation including the boundary conditions as a system of linear equations with a large number of unknowns $$ {\bf A}{\bf u} = {\bf f}({\bf u}). $$

Gaussian Elimination and Tridiagonal matrices, project 1

We start with the linear set of equations $$ {\bf A}{\bf u} = {\bf f}, $$ where $ {\bf A} $ is a tridiagonal matrix which we rewrite as $$ {\bf A} = \left(\begin{array}{cccccc} b_1& c_1 & 0 &\dots & \dots &\dots \\ a_2 & b_2 & c_2 &\dots &\dots &\dots \\ & a_3 & b_3 & c_3 & \dots & \dots \\ & \dots & \dots &\dots &\dots & \dots \\ & & &a_{n-2} &b_{n-1}& c_{n-1} \\ & & & &a_n & b_n \\ \end{array} \right) $$ where $ a,b,c $ are one-dimensional arrays of length $ 1:n $. In project 1 the arrays $ a $ and $ c $ are equal, namely $ a_i=c_i=-1/h^2 $. The matrix is also positive definite.

Gaussian Elimination and Tridiagonal matrices, project 1

We can rewrite as $$ {\bf A} = \left(\begin{array}{cccccc} b_1& c_1 & 0 &\dots & \dots &\dots \\ a_2 & b_2 & c_2 &\dots &\dots &\dots \\ & a_3 & b_3 & c_3 & \dots & \dots \\ & \dots & \dots &\dots &\dots & \dots \\ & & &a_{n-2} &b_{n-1}& c_{n-1} \\ & & & &a_n & b_n \\ \end{array} \right)\left(\begin{array}{c} u_1\\ u_2\\ \dots \\ \dots \\ \dots \\ u_n\\ \end{array} \right) =\left(\begin{array}{c} f_1\\ f_2\\ \dots \\ \dots \\ \dots \\ f_n\\ \end{array} \right). $$

Gaussian Elimination and Tridiagonal matrices, project 1

A tridiagonal matrix is a special form of banded matrix where all the elements are zero except for those on and immediately above and below the leading diagonal. The above tridiagonal system can be written as $$ a_iu_{i-1}+b_iu_i+c_iu_{i+1} = f_i, $$ for $ i=1,2,\dots,n $. We see that $ u_{-1} $ and $ u_{n+1} $ are not required and we can set $ a_1=c_n=0 $. In many applications the matrix is symmetric and we have $ a_i=c_i $. The algorithm for solving this set of equations is rather simple and requires two steps only, a forward substitution and a backward substitution. These steps are also common to the algorithms based on Gaussian elimination that we discussed previously. However, due to its simplicity, the number of floating point operations is in this case proportional with $ O(n) $ while Gaussian elimination requires $ 2n^3/3+O(n^2) $ floating point operations.

Gaussian Elimination and Tridiagonal matrices, project 1

In case your system of equations leads to a tridiagonal matrix, it is clearly an overkill to employ Gaussian elimination or the standard LU decomposition. You will encounter several applications involving tridiagonal matrices in our discussion of partial differential equations in chapter 10.

Our algorithm starts with forward substitution with a loop over of the elements $ i $ and can be expressed via the following piece of code

   btemp = b[1];
   u[1] = f[1]/btemp;
   for(i=2 ; i <= n ; i++) {
      temp[i] = c[i-1]/btemp;
      btemp = b[i]-a[i]*temp[i];
      u[i] = (f[i] - a[i]*u[i-1])/btemp;

Gaussian Elimination and Tridiagonal matrices, project 1

Note that you should avoid cases with $ b_1=0 $. If that is the case, you should rewrite the equations as a set of order $ n-1 $ with $ u_2 $ eliminated. Finally we perform the backsubstitution leading to the following code

   for(i=n-1 ; i >= 1 ; i--) {
      u[i] -= temp[i+1]*u[i+1];

Gaussian Elimination and Tridiagonal matrices, project 1

Note that our sums start with $ i=1 $ and that one should avoid cases with $ b_1=0 $. If that is the case, you should rewrite the equations as a set of order $ n-1 $ with $ u_2 $ eliminated. However, a tridiagonal matrix problem is not a guarantee that we can find a solution. The matrix $ {\bf A} $ which rephrases a second derivative in a discretized form $$ {\bf A} = \left(\begin{array}{cccccc} 2 & -1 & 0 & 0 &0 & 0\\ -1 & 2 & -1 &0 &0 &0 \\ 0 & -1 & 2 & -1 & 0& 0 \\ 0 & \dots & \dots & \dots &\dots & \dots \\ 0 &0 &0 &-1 &2& -1 \\ 0 & 0 &0 &0 &-1 & 2 \\ \end{array} \right), $$ fulfills the condition of a weak dominance of the diagonal, with $ |b_1| > |c_1| $, $ |b_n| > |a_n| $ and $ |b_k| \ge |a_k|+|c_k| $ for $ k=2,3,\dots,n-1 $. This is a relevant but not sufficient condition to guarantee that the matrix $ {\bf A} $ yields a solution to a linear equation problem. The matrix needs also to be irreducible. A tridiagonal irreducible matrix means that all the elements $ a_i $ and $ c_i $ are non-zero. If these two conditions are present, then $ {\bf A} $ is nonsingular and has a unique LU decomposition.

Project 1, hints

When setting up the algo it is useful to note that the different operations on the matrix (here as a $ 4\times 4 $ case with diagonals $ d_i $ and off-diagonals $ e_i $ $$ \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ e_1 & d_2 & e_2 & 0 \\ 0 & e_2 & d_3 & e_3 \\ 0 & 0 & e_3 & d_4 \end{array} \right)\rightarrow \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & e_2 & d_3 & e_3 \\ 0 & 0 & e_3 & d_4 \end{array} \right)\rightarrow \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & 0 & \tilde{d}_3 & e_3 \\ 0 & 0 & e_3 & d_4 \end{array} \right) $$ and finally $$ \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & 0 & \tilde{d}_3 & e_3 \\ 0 & 0 & 0 & \tilde{d}_4 \end{array} \right) $$

Project 1, hints

We notice the sub-blocks which get repeated $$ \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & 0 & \tilde{d}_3 & e_3 \\ 0 & 0 & 0 & \tilde{d}_4 \end{array} \right) $$ The matrices we often end up with in rewriting for for example partial differential equations, have the feature that all leading principal submatrices are non-singular. If the matrix is symmetric as well it can be rewritten as $ A=LDL^T $ with $ D $ the diagonal and we have the following relations $ a_{11} = d_1 $, $ a_{k,k-1}=e_{k-1}d_{k-1} $ for $ k=2,\dots,n $ and finally $$ a_{kk} = d_k+e_{k-1}^2d_{k-1}=d_k+e_{k-1}a_{k,k-1} $$ for $ k=2,\dots,n $.

Linear Algebra Methods

Gaussian elimination, $ O(2/3n^3) $ flops, general matrix
LU decomposition, upper triangular and lower tridiagonal matrices, $ O(2/3n^3) $ flops, general matrix. Get easily the inverse, determinant and can solve linear equations with back-substitution only, $ O(n^2) $ flops
Cholesky decomposition $ A=LL^T $. Real symmetric or hermitian positive definite matrix, $ O(1/3n^3) $ flops.
Tridiagonal linear systems, important for differential equations. Normally positive definite and non-singular. $ O(8n) $ flops for symmetric. $ A=LDL^T $ with $ D $ the diagonal. Special case of banded matrices.
Singular value decomposition
the QR method will be discussed in chapter 7 in connection with eigenvalue systems. $ O(4/3n^3) $ flops. !split

LU Decomposition

The LU decomposition method means that we can rewrite this matrix as the product of two matrices $ {\bf L} $ and $ {\bf U} $ where $$ \label{eq3} \left(\begin{array}{cccc} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{array} \right) = \left( \begin{array}{cccc} 1 & 0 & 0 & 0 \\ l_{21} & 1 & 0 & 0 \\ l_{31} & l_{32} & 1 & 0 \\ l_{41} & l_{42} & l_{43} & 1 \end{array} \right) \left( \begin{array}{cccc} u_{11} & u_{12} & u_{13} & u_{14} \\ 0 & u_{22} & u_{23} & u_{24} \\ 0 & 0 & u_{33} & u_{34} \\ 0 & 0 & 0 & u_{44} \end{array} \right). $$ LU decomposition forms the backbone of other algorithms in linear algebra, such as the solution of linear equations given by $$ \begin{align} a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=&w_1 \nonumber \\ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=&w_2 \nonumber \\ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=&w_3 \nonumber \\ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=&w_4. \nonumber \end{align} $$ The above set of equations is conveniently solved by using LU decomposition as an intermediate step.

The matrix $ {\bf A}\in \mathbb{R}^{n\times n} $ has an LU factorization if the determinant is different from zero. If the LU factorization exists and $ {\bf A} $ is non-singular, then the LU factorization is unique and the determinant is given by $$ det\{{\bf A}\}=det\{{\bf LU}\}= det\{{\bf L}\}det\{{\bf U}\}=u_{11}u_{22}\dots u_{nn}. $$

LU Decomposition, why?

There are at least three main advantages with LU decomposition compared with standard Gaussian elimination:

It is straightforward to compute the determinant of a matrix
If we have to solve sets of linear equations with the same matrix but with different vectors $ {\bf y} $, the number of FLOPS is of the order $ n^3 $.
The invers is such an operation !split

LU Decomposition, linear equations

With the LU decomposition it is rather simple to solve a system of linear equations $$ \begin{align} a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=&w_1 \nonumber \\ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=&w_2 \nonumber \\ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=&w_3 \nonumber \\ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=&w_4. \nonumber \end{align} $$

This can be written in matrix form as $$ {\bf Ax}={\bf w}. $$

where $ {\bf A} $ and $ {\bf w} $ are known and we have to solve for $ {\bf x} $. Using the LU dcomposition we write $$ {\bf A} {\bf x} \equiv {\bf L} {\bf U} {\bf x} ={\bf w}. $$

LU Decomposition, linear equations

The previous equation can be calculated in two steps $$ {\bf L} {\bf y} = {\bf w};\qquad {\bf Ux}={\bf y}. $$

To show that this is correct we use to the LU decomposition to rewrite our system of linear equations as $$ {\bf LUx}={\bf w}, $$ and since the determinat of $ {\bf L} $ is equal to 1 (by construction since the diagonals of $ {\bf L} $ equal 1) we can use the inverse of $ {\bf L} $ to obtain $$ {\bf Ux}={\bf L^{-1}w}={\bf y}, $$ which yields the intermediate step $$ {\bf L^{-1}w}={\bf y} $$ and as soon as we have $ {\bf y} $ we can obtain $ {\bf x} $ through $ {\bf Ux}={\bf y} $.

LU Decomposition, why?

For our four-dimentional example this takes the form $$ \begin{align} y_1=&w_1 \nonumber\\ l_{21}y_1 + y_2=&w_2\nonumber \\ l_{31}y_1 + l_{32}y_2 + y_3 =&w_3\nonumber \\ l_{41}y_1 + l_{42}y_2 + l_{43}y_3 + y_4=&w_4. \nonumber \end{align} $$

and $$ \begin{align} u_{11}x_1 +u_{12}x_2 +u_{13}x_3 + u_{14}x_4=&y_1 \nonumber\\ u_{22}x_2 + u_{23}x_3 + u_{24}x_4=&y_2\nonumber \\ u_{33}x_3 + u_{34}x_4=&y_3\nonumber \\ u_{44}x_4=&y_4 \nonumber \end{align} $$

This example shows the basis for the algorithm needed to solve the set of $ n $ linear equations.

LU Decomposition, linear equations

The algorithm goes as follows

Set up the matrix $ \bf A $ and the vector $ \bf w $ with their correct dimensions. This determines the dimensionality of the unknown vector $ \bf x $.
Then LU decompose the matrix $ \bf A $ through a call to the function ludcmp(double a, int n, int indx, double &d). This functions returns the LU decomposed matrix $ \bf A $, its determinant and the vector indx which keeps track of the number of interchanges of rows. If the determinant is zero, the solution is malconditioned.
Thereafter you call the function lubksb(double a, int n, int indx, double w) which uses the LU decomposed matrix $ \bf A $ and the vector $ \bf w $ and returns $ \bf x $ in the same place as $ \bf w $. Upon exit the original content in $ \bf w $ is destroyed. If you wish to keep this information, you should make a backup of it in your calling function.

LU Decomposition, the inverse of a matrix

If the inverse exists then $$ {\bf A}^{-1}{\bf A}={\bf I}, $$ the identity matrix. With an LU decomposed matrix we can rewrite the last equation as $$ {\bf LU}{\bf A}^{-1}={\bf I}. $$ If we assume that the first column (that is column 1) of the inverse matrix can be written as a vector with unknown entries $$ {\bf A}_1^{-1}= \left( \begin{array}{c} a_{11}^{-1} \\ a_{21}^{-1} \\ \dots \\ a_{n1}^{-1} \\ \end{array} \right), $$ then we have a linear set of equations $$ {\bf LU}\left( \begin{array}{c} a_{11}^{-1} \\ a_{21}^{-1} \\ \dots \\ a_{n1}^{-1} \\ \end{array} \right) =\left( \begin{array}{c} 1 \\ 0 \\ \dots \\ 0 \\ \end{array} \right). $$

LU Decomposition, the inverse

In a similar way we can compute the unknow entries of the second column, $$ {\bf LU}\left( \begin{array}{c} a_{12}^{-1} \\ a_{22}^{-1} \\ \dots \\ a_{n2}^{-1} \\ \end{array} \right) =\left( \begin{array}{c} 0 \\ 1 \\ \dots \\ 0 \\ \end{array} \right), $$ and continue till we have solved all $ n $ sets of linear equations.

How to use the Library functions

Standard C/C++: fetch the files lib.cpp and lib.h. You can make a directory where you store these files, and eventually its compiled version lib.o. The example here is program1.cpp from chapter 6 and performs the matrix inversion.

//  Simple matrix inversion example
#include <iostream>
#include <new>
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <cstring>
#include "lib.h"

using namespace std;

/* function declarations */

void inverse(double **, int);

How to use the Library functions

void inverse(double **a, int n)
{
  int          i,j, *indx;
  double       d, *col, **y;
  // allocate space in memory
  indx = new int[n];
  col  = new double[n];
  y    = (double **) matrix(n, n, sizeof(double));
  ludcmp(a, n, indx, &d);   // LU decompose  a[][]
  printf("\n\nLU form of matrix of a[][]:\n");
  for(i = 0; i < n; i++) {
    printf("\n");
    for(j = 0; j < n; j++) {
      printf(" a[%2d][%2d] = %12.4E",i, j, a[i][j]);

How to use the Library functions

  // find inverse of a[][] by columns
  for(j = 0; j < n; j++) {
    // initialize right-side of linear equations
    for(i = 0; i < n; i++) col[i] = 0.0;
    col[j] = 1.0;
    lubksb(a, n, indx, col);
    // save result in y[][]
    for(i = 0; i < n; i++) y[i][j] = col[i];
  }   //j-loop over columns
  // return the inverse matrix in a[][]
  for(i = 0; i < n; i++) {
    for(j = 0; j < n; j++) a[i][j] = y[i][j];

  free_matrix((void **) y);     // release local memory
  delete [] col;
  delete []indx;
}  // End: function inverse()

How to use the Library functions

For Fortran users:

PROGRAM matrix
  USE constants
  USE F90library
  IMPLICIT NONE
  !      The definition of the matrix, using dynamic allocation
  REAL(DP), ALLOCATABLE, DIMENSION(:,:) :: a, ainv, unity
  !      the determinant
  REAL(DP) :: d
  !      The size of the matrix
  INTEGER :: n
  ....
  !      Allocate now place in heap for a
  ALLOCATE ( a(n,n), ainv(n,n), unity(n,n) )

How to use the Library functions

For Fortran users:

  WRITE(6,*) ' The matrix before inversion'
  WRITE(6,'(3F12.6)') a
  ainv=a
  CALL matinv (ainv, n, d)
  ....
  !      get the unity matrix
  unity=MATMUL(ainv,a)
  WRITE(6,*) ' The unity matrix'
  WRITE(6,'(3F12.6)') unity
  !      deallocate all arrays
  DEALLOCATE (a, ainv, unity)
END PROGRAM matrix

Week 36

Overview of week 36

Linear Algebra and project 1.

Discussion of Project 1
Object orientation and unit testing (see detailed instruction on webpage)

Object orientation

Why object orientation?

Three main topics: objects, class hierarchies and polymorphism
The aim here is to be to be able to write a more general code which can easily be tailored to new situations.
{\bf Polymorphism} is a term used in software development to describe a variety of techniques employed by programmers to create flexible and reusable software components. The term is Greek and it loosely translates to "many forms". Strategy: try to single out the variables needed to describe a given system and those needed to describe a given solver. !split

Object orientation

In programming languages, a polymorphic object is an entity, such as a variable or a procedure, that can hold or operate on values of differing types during the program's execution. Because a polymorphic object can operate on a variety of values and types, it can also be used in a variety of programs, sometimes with little or no change by the programmer. The idea of write once, run many, also known as code reusability, is an important characteristic to the programming paradigm known as Object-Oriented Programming (OOP).

OOP describes an approach to programming where a program is viewed as a collection of interacting, but mostly independent software components. These software components are known as objects in OOP and they are typically implemented in a programming language as an entity that encapsulates both data and procedures.

Programming classes

In Fortran a vector or matrix start with $ 1 $, but it is easy to change a vector so that it starts with zero or even a negative number. If we have a double precision Fortran vector which starts at $ -10 $ and ends at $ 10 $, we could declare it as REAL(KIND=8) :: vector(-10:10). Similarly, if we want to start at zero and end at 10 we could write REAL(KIND=8) :: vector(0:10). We have also seen that Fortran allows us to write a matrix addition $ {\bf A} = {\bf B}+{\bf C} $ as A = B + C. This means that we have overloaded the addition operator so that it translates this operation into two loops and an addition of two matrix elements $ a_{ij} = b_{ij}+c_{ij} $.

Programming classes

The way the matrix addition is written is very close to the way we express this relation mathematically. The benefit for the programmer is that our code is easier to read. Furthermore, such a way of coding makes it more likely to spot eventual errors as well.

In Ansi C and C++ arrays start by default from $ i=0 $. Moreover, if we wish to add two matrices we need to explicitely write out the two loops as

   for(i=0 ; i < n ; i++) {
      for(j=0 ; j < n ; j++) {
         a[i][j]=b[i][j]+c[i][j]

Programming classes

However, the strength of C++ is the possibility to define new data types, tailored to some particular problem. Via new data types and overloading of operations such as addition and subtraction, we can easily define sets of operations and data types which allow us to write a matrix addition in exactly the same way as we would do in Fortran. We could also change the way we declare a C++ matrix elements $ a_{ij} $, from $ a[i][j] $ to say $ a(i,j) $, as we would do in Fortran. Similarly, we could also change the default range from $ 0:n-1 $ to $ 1:n $.

To achieve this we need to introduce two important entities in C++ programming, classes and templates.

Programming classes

The function and class declarations are fundamental concepts within C++. Functions are abstractions which encapsulate an algorithm or parts of it and perform specific tasks in a program. We have already met several examples on how to use functions. Classes can be defined as abstractions which encapsulate data and operations on these data. The data can be very complex data structures and the class can contain particular functions which operate on these data. Classes allow therefore for a higher level of abstraction in computing. The elements (or components) of the data type are the class data members, and the procedures are the class member functions.

Programming classes

Classes are user-defined tools used to create multi-purpose software which can be reused by other classes or functions. These user-defined data types contain data (variables) and functions operating on the data.

A simple example is that of a point in two dimensions. The data could be the $ x $ and $ y $ coordinates of a given point. The functions we define could be simple read and write functions or the possibility to compute the distance between two points.

Programming classes

C++ has a class complex in its standard template library (STL). The standard usage in a given function could then look like

// Program to calculate addition and multiplication of two complex numbers
using namespace std;
#include <iostream>
#include <cmath>
#include <complex>
int main()
{
  complex<double> x(6.1,8.2), y(0.5,1.3);
  // write out x+y
  cout << x + y << x*y  << endl;
  return 0;

where we add and multiply two complex numbers $ x=6.1+\imath 8.2 $ and $ y=0.5+\imath 1.3 $ with the obvious results $ z=x+y=6.6+\imath 9.5 $ and $ z=x\cdot y= -7.61+\imath 12.03 $.

Programming classes

We proceed by splitting our task in three files.

We define first a header file complex.h which contains the declarations of the class. The header file contains the class declaration (data and functions), declaration of stand-alone functions, and all inlined functions, starting as follows

#ifndef Complex_H
#define Complex_H
//   various include statements and definitions
#include <iostream>          // Standard ANSI-C++ include files
#include <new>
#include ....

class Complex
{...
definition of variables and their character
};
//   declarations of various functions used by the class
...
#endif

Programming classes

Next we provide a file complex.cpp where the code and algorithms of different functions (except inlined functions) declared within the class are written. The files complex.h and complex.cpp are normally placed in a directory with other classes and libraries we have defined.

Finally, we discuss here an example of a main program which uses this particular class. An example of a program which uses our complex class is given below. In particular we would like our class to perform tasks like declaring complex variables, writing out the real and imaginary part and performing algebraic operations such as adding or multiplying two complex numbers.

Programming classes

#include "Complex.h"
...  other include and declarations
int main ()
{
  Complex a(0.1,1.3);    // we declare a complex variable a
  Complex b(3.0), c(5.0,-2.3);  // we declare  complex variables b and c
  Complex d = b;         //  we declare  a new complex variable d
  cout << "d=" << d << ", a=" << a << ", b=" << b << endl;
  d = a*c + b/a;  //   we add, multiply and divide two complex numbers
  cout << "Re(d)=" << d.Re() << ", Im(d)=" << d.Im() << endl;  // write out of the real and imaginary parts

Programming classes

We include the header file complex.h and define four different complex variables. These are $ a=0.1+\imath 1.3 $, $ b=3.0+\imath 0 $ (note that if you don't define a value for the imaginary part this is set to zero), $ c=5.0-\imath 2.3 $ and $ d=b $. Thereafter we have defined standard algebraic operations and the member functions of the class which allows us to print out the real and imaginary part of a given variable.

Programming classes

class Complex
{
private:
   double re, im; // real and imaginary part
public:
   Complex ();                              // Complex c;
   Complex (double re, double im = 0.0); // Definition of a complex variable;
   Complex (const Complex& c);              // Usage: Complex c(a);   // equate two complex variables
   Complex& operator= (const Complex& c); // c = a;   //  equate two complex variables, same as previous
....

Programming classes

  ~Complex () {}                        // destructor
   double   Re () const;        // double real_part = a.Re();
   double   Im () const;        // double imag_part = a.Im();
   double   abs () const;       // double m = a.abs(); // modulus
   friend Complex operator+ (const Complex&  a, const Complex& b);
   friend Complex operator- (const Complex&  a, const Complex& b);
   friend Complex operator* (const Complex&  a, const Complex& b);
   friend Complex operator/ (const Complex&  a, const Complex& b);
};

Programming classes

The class is defined via the statement class Complex. We must first use the key word class, which in turn is followed by the user-defined variable name Complex. The body of the class, data and functions, is encapsulated within the parentheses {...}.

Programming classes

Data and specific functions can be private, which means that they cannot be accessed from outside the class. This means also that access cannot be inherited by other functions outside the class. If we use protected instead of private, then data and functions can be inherited outside the class.

Programming classes

The key word public means that data and functions can be accessed from outside the class. Here we have defined several functions which can be accessed by functions outside the class. The declaration friend means that stand-alone functions can work on privately declared variables of the type (re, im). Data members of a class should be declared as private variables.

Programming classes

The first public function we encounter is a so-called constructor, which tells how we declare a variable of type Complex and how this variable is initialized. We have chose three possibilities in the example above:

A declaration like Complex c; calls the member function Complex() which can have the following implementation

Complex:: Complex () { re = im = 0.0; }

meaning that it sets the real and imaginary parts to zero. Note the way a member function is defined. The constructor is the first function that is called when an object is instantiated.

Programming classes

Another possibility is

Complex:: Complex () {}

which means that there is no initialization of the real and imaginary parts. The drawback is that a given compiler can then assign random values to a given variable.

A call like Complex a(0.1,1.3); means that we could call the member function `Complex(double, double)`as

Complex:: Complex (double re_a, double im_a) {
    re = re_a; im = im_a; }

Programming classes

The simplest member function are those we defined to extract the real and imaginary part of a variable. Here you have to recall that these are private data, that is they invisible for users of the class. We obtain a copy of these variables by defining the functions

double Complex:: Re () const { return re; }} //  getting the real part
double Complex:: Im () const { return im; }  //   and the imaginary part

Note that we have introduced the declaration const. What does it mean? This declaration means that a variabale cannot be changed within a called function.

Programming classes

If we define a variable as const double p = 3; and then try to change its value, we will get an error when we compile our program. This means that constant arguments in functions cannot be changed.

// const arguments (in functions) cannot be changed:
void myfunc (const Complex& c)
{ c.re = 0.2; /* ILLEGAL!! compiler error... */  }

If we declare the function and try to change the value to $ 0.2 $, the compiler will complain by sending an error message.

Programming classes

If we define a function to compute the absolute value of complex variable like

double Complex:: abs ()  { return sqrt(re*re + im*im);}

without the constant declaration and define thereafter a function myabs as

double myabs (const Complex& c)
{ return c.abs(); }   // Not ok because c.abs() is not a const func.

the compiler would not allow the c.abs() call in myabs since Complex::abs is not a constant member function.

Programming classes

Constant functions cannot change the object's state. To avoid this we declare the function abs as

double Complex:: abs () const { return sqrt(re*re + im*im); }

Programming classes

C++ (and Fortran) allow for overloading of operators. That means we can define algebraic operations on for example vectors or any arbitrary object. As an example, a vector addition of the type $ {\bf c} = {\bf a} + {\bf b} $ means that we need to write a small part of code with a for-loop over the dimension of the array. We would rather like to write this statement as c = a+b; as this makes the code much more readable and close to eventual equations we want to code. To achieve this we need to extend the definition of operators.

Programming classes

Let us study the declarations in our complex class. In our main function we have a statement like d = b;, which means that we call d.operator= (b) and we have defined a so-called assignment operator as a part of the class defined as

Complex& Complex:: operator= (const Complex& c)
{
   re = c.re;
   im = c.im;
   return *this;
}

Programming classes

With this function, statements like Complex d = b; or Complex d(b); make a new object $ d $, which becomes a copy of $ b $. We can make simple implementations in terms of the assignment

Complex:: Complex (const Complex& c)
{ *this = c; }

which is a pointer to "this object", *this is the present object, so *this = c; means setting the present object equal to $ c $, that is this->operator= (c);.

Programming classes

The meaning of the addition operator $ + $ for Complex objects is defined in the function Complex operator+ (const Complex& a, const Complex& b); // a+b The compiler translates c = a + b; into c = operator+ (a, b);. Since this implies the call to function, it brings in an additional overhead. If speed is crucial and this function call is performed inside a loop, then it is more difficult for a given compiler to perform optimizations of a loop.

Programming classes

The solution to this is to inline functions. We discussed inlining in chapter 2 of the lecture notes. Inlining means that the function body is copied directly into the calling code, thus avoiding calling the function. Inlining is enabled by the inline keyword

inline Complex operator+ (const Complex& a, const Complex& b)
{ return Complex (a.re + b.re, a.im + b.im); }

Inline functions, with complete bodies must be written in the header file complex.h.

Programming classes

Consider the case c = a + b; that is, c.operator= (operator+ (a,b)); If operator+, operator= and the constructor Complex(r,i) all are inline functions, this transforms to

c.re = a.re + b.re;
c.im = a.im + b.im;

by the compiler, i.e., no function calls

Programming classes

The stand-alone function operator+ is a friend of the Complex class

class Complex
{
   ...
   friend Complex operator+ (const Complex& a, const Complex& b);
   ...
};

so it can read (and manipulate) the private data parts $ re $ and $ im $ via

inline Complex operator+ (const Complex& a, const Complex& b)
{ return Complex (a.re + b.re, a.im + b.im); }

Programming classes

Since we do not need to alter the re and im variables, we can get the values by Re() and Im(), and there is no need to be a friend function

inline Complex operator+ (const Complex& a, const Complex& b)
{ return Complex (a.Re() + b.Re(), a.Im() + b.Im()); }

Programming classes

The multiplication functionality can now be extended to imaginary numbers by the following code

inline Complex operator* (const Complex& a, const Complex& b)
{
  return Complex(a.re*b.re - a.im*b.im, a.im*b.re + a.re*b.im);

It will be convenient to inline all functions used by this operator.

Programming classes

To inline the complete expression a*b;, the constructors and operator= must also be inlined. This can be achieved via the following piece of code

inline Complex:: Complex () { re = im = 0.0; }
inline Complex:: Complex (double re_, double im_)
{ ... }
inline Complex:: Complex (const Complex& c)
{ ... }
inline Complex:: operator= (const Complex& c)
{ ... }

Programming classes

// e, c, d are complex
e = c*d;
// first compiler translation:
e.operator= (operator* (c,d));
// result of nested inline functions
// operator=, operator*, Complex(double,double=0):
e.re = c.re*d.re - c.im*d.im;
e.im = c.im*d.re + c.re*d.im;

The definitions operator- and operator/ follow the same set up.

Programming classes

Finally, if we wish to write to file or another device a complex number using the simple syntax cout << c;, we obtain this by defining the effect of $ < < $ for a Complex object as

ostream& operator<< (ostream& o, const Complex& c)
{ o << "(" << c.Re() << "," << c.Im() << ") "; return o;}

Programming classes, templates

What if we wanted to make a class which takes integers or floating point numbers with single precision? A simple way to achieve this is copy and paste our class and replace double with for example int.

C++ allows us to do this automatically via the usage of templates, which are the C++ constructs for parameterizing parts of classes. Class templates is a template for producing classes. The declaration consists of the keyword template followed by a list of template arguments enclosed in brackets.

Programming classes

We can therefore make a more general class by rewriting our original example as

template<class T>
class Complex
{
private:
   T re, im; // real and imaginary part
public:
   Complex ();                              // Complex c;
   Complex (T re, T im = 0); // Definition of a complex variable;
   Complex (const Complex& c);              // Usage: Complex c(a);   // equate two complex variables
   Complex& operator= (const Complex& c); // c = a;   //  equate two complex variables, same as previous

Programming classes

We can therefore make a more general class by rewriting our original example as

  ~Complex () {}                        // destructor
   T   Re () const;        // T real_part = a.Re();
   T   Im () const;        // T imag_part = a.Im();
   T   abs () const;       // T m = a.abs(); // modulus
   friend Complex operator+ (const Complex&  a, const Complex& b);
   friend Complex operator- (const Complex&  a, const Complex& b);
   friend Complex operator* (const Complex&  a, const Complex& b);
   friend Complex operator/ (const Complex&  a, const Complex& b);
};

Programming classes

What it says is that Complex is a parameterized type with $ T $ as a parameter and $ T $ has to be a type such as double or float. The class complex is now a class template and we would define variables in a code as

Complex<double> a(10.0,5.1);
Complex<int> b(1,0);

Programming classes

Member functions of our class are defined by preceding the name of the function with the template keyword. Consider the function we defined as Complex:: Complex (double re_a, double im_a). We would rewrite this function as

template<class T>
Complex<T>:: Complex (T re_a, T im_a)
{ re = re_a; im = im_a; }

The member functions are otherwise defined following ordinary member function definitions.

Programming classes

Here follows a very simple first class in the file squared.h

// Not all declarations here
// Class to compute the square of a number
template<class T>
class Squared{
  public:
    // Default constructor, not used here
    Squared(){}

    // Overload the function operator()
    T operator()(T x){return x*x;}

};

Programming classes

and we would use it as

#include <iostream>
#include "squared.h"
using namespace std;

int main(){
  Squared<double> s;
  cout << s(3) << endl;

Week 37

Overview of week 37

Eigenvalue problems and project 2.

Discussion of Jacobi's algorithm, chapter 7.1-7.2
Presentation of project 2.

Reading assignment this week: chapters 7.1-7.4 of lecture notes

Eigenvalue problems, basic definitions

Let us consider the matrix $ {\bf A} $ of dimension $ n $. The eigenvalues of $ {\bf A} $ are defined through the matrix equation $$ {\bf A}{\bf x}^{(\nu)} = \lambda^{(\nu)}{\bf x}^{(\nu)}, $$ where $ \lambda^{(\nu)} $ are the eigenvalues and $ {\bf x}^{(\nu)} $ the corresponding eigenvectors. Unless otherwise stated, when we use the wording eigenvector we mean the right eigenvector. The left eigenvalue problem is defined as $$ {\bf x}^{(\nu)}_L{\bf A} = \lambda^{(\nu)}{\bf x}^{(\nu)}_L $$ The above right eigenvector problem is equivalent to a set of $ n $ equations with $ n $ unknowns $ x_i $.

Eigenvalue problems, basic definitions

The eigenvalue problem can be rewritten as $$ \left( {\bf A}-\lambda^{(\nu)} {\bf I} \right) {\bf x}^{(\nu)} = 0, $$ with $ {\bf I} $ being the unity matrix. This equation provides a solution to the problem if and only if the determinant is zero, namely $$ \left| {\bf A}-\lambda^{(\nu)}{\bf I}\right| = 0, $$ which in turn means that the determinant is a polynomial of degree $ n $ in $ \lambda $ and in general we will have $ n $ distinct zeros.

Eigenvalue problems, basic definitions

The eigenvalues of a matrix $ {\bf A}\in {\mathbb{C}}^{n\times n} $ are thus the $ n $ roots of its characteristic polynomial $$ P(\lambda) = det(\lambda{\bf I}-{\bf A}), $$ or $$ P(\lambda)= \prod_{i=1}^{n}\left(\lambda_i-\lambda\right). $$ The set of these roots is called the spectrum and is denoted as $ \lambda({\bf A}) $. If $ \lambda({\bf A})=\left\{\lambda_1,\lambda_2,\dots ,\lambda_n\right\} $ then we have $$ det({\bf A})= \lambda_1\lambda_2\dots\lambda_n, $$ and if we define the trace of $ {\bf A} $ as $$ Tr({\bf A})=\sum_{i=1}^n a_{ii}$$ then $$ Tr({\bf A})=\lambda_1+\lambda_2+\dots+\lambda_n. $$

Abel-Ruffini Impossibility Theorem

The Abel-Ruffini theorem (also known as Abel's impossibility theorem) states that there is no general solution in radicals to polynomial equations of degree five or higher.

The content of this theorem is frequently misunderstood. It does not assert that higher-degree polynomial equations are unsolvable. In fact, if the polynomial has real or complex coefficients, and we allow complex solutions, then every polynomial equation has solutions; this is the fundamental theorem of algebra. Although these solutions cannot always be computed exactly with radicals, they can be computed to any desired degree of accuracy using numerical methods such as the Newton-Raphson method or Laguerre method, and in this way they are no different from solutions to polynomial equations of the second, third, or fourth degrees.

The theorem only concerns the form that such a solution must take. The content of the theorem is that the solution of a higher-degree equation cannot in all cases be expressed in terms of the polynomial coefficients with a finite number of operations of addition, subtraction, multiplication, division and root extraction. Some polynomials of arbitrary degree, of which the simplest nontrivial example is the monomial equation $ ax^n = b $, are always solvable with a radical.

Abel-Ruffini Impossibility Theorem

The Abel-Ruffini theorem says that there are some fifth-degree equations whose solution cannot be so expressed. The equation $ x^5 - x + 1 = 0 $ is an example. Some other fifth degree equations can be solved by radicals, for example $ x^5 - x^4 - x + 1 = 0 $. The precise criterion that distinguishes between those equations that can be solved by radicals and those that cannot was given by Galois and is now part of Galois theory: a polynomial equation can be solved by radicals if and only if its Galois group is a solvable group.

Today, in the modern algebraic context, we say that second, third and fourth degree polynomial equations can always be solved by radicals because the symmetric groups $ S_2, S_3 $ and $ S_4 $ are solvable groups, whereas $ S_n $ is not solvable for $ n \ge 5 $.

Eigenvalue problems, basic definitions

In the present discussion we assume that our matrix is real and symmetric, that is $ {\bf A}\in {\mathbb{R}}^{n\times n} $. The matrix $ {\bf A} $ has $ n $ eigenvalues $ \lambda_1\dots \lambda_n $ (distinct or not). Let $ {\bf D} $ be the diagonal matrix with the eigenvalues on the diagonal $$ {\bf D}= \left( \begin{array}{ccccccc} \lambda_1 & 0 & 0 & 0 & \dots &0 & 0 \\ 0 & \lambda_2 & 0 & 0 & \dots &0 &0 \\ 0 & 0 & \lambda_3 & 0 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &\lambda_{n-1} & \\ 0 & \dots & \dots & \dots &\dots &0 & \lambda_n \end{array} \right). $$ If $ {\bf A} $ is real and symmetric then there exists a real orthogonal matrix $ {\bf S} $ such that $$ {\bf S}^T {\bf A}{\bf S}= \mathrm{diag}(\lambda_1,\lambda_2,\dots ,\lambda_n), $$ and for $ j=1:n $ we have $ {\bf A}{\bf S}(:,j) = \lambda_j {\bf S}(:,j) $.

Eigenvalue problems, basic definitions

To obtain the eigenvalues of $ {\bf A}\in {\mathbb{R}}^{n\times n} $, the strategy is to perform a series of similarity transformations on the original matrix $ {\bf A} $, in order to reduce it either into a diagonal form as above or into a tridiagonal form.

We say that a matrix $ {\bf B} $ is a similarity transform of $ {\bf A} $ if $$ {\bf B}= {\bf S}^T {\bf A}{\bf S}, \hspace{1cm} \mathrm{where} \hspace{1cm} {\bf S}^T{\bf S}={\bf S}^{-1}{\bf S} ={\bf I}. $$ The importance of a similarity transformation lies in the fact that the resulting matrix has the same eigenvalues, but the eigenvectors are in general different.

Eigenvalue problems, basic definitions

To prove this we start with the eigenvalue problem and a similarity transformed matrix $ {\bf B} $. $$ {\bf A}{\bf x}=\lambda{\bf x} \hspace{1cm} \mathrm{and}\hspace{1cm} {\bf B}= {\bf S}^T {\bf A}{\bf S}. $$ We multiply the first equation on the left by $ {\bf S}^T $ and insert $ {\bf S}^{T}{\bf S} = {\bf I} $ between $ {\bf A} $ and $ {\bf x} $. Then we get $$ \begin{equation} ({\bf S}^T{\bf A}{\bf S})({\bf S}^T{\bf x})=\lambda{\bf S}^T{\bf x} , \end{equation} $$ which is the same as $$ {\bf B} \left ( {\bf S}^T{\bf x} \right ) = \lambda \left ({\bf S}^T{\bf x}\right ). $$ The variable $ \lambda $ is an eigenvalue of $ {\bf B} $ as well, but with eigenvector $ {\bf S}^T{\bf x} $.

Eigenvalue problems, basic definitions

The basic philosophy is to

Either apply subsequent similarity transformations (direct method) so that

$$ \begin{equation} {\bf S}_N^T\dots {\bf S}_1^T{\bf A}{\bf S}_1\dots {\bf S}_N={\bf D} , \end{equation} $$

Or apply subsequent similarity transformations so that $ {\bf A} $ becomes tridiagonal (Householder) or upper/lower triangular (the QR method to be discussed later).
Thereafter, techniques for obtaining eigenvalues from tridiagonal matrices can be used.
Or use so-called power methods
Or use iterative methods (Krylov, Lanczos, Arnoldi). These methods are popular for huge matrix problems.

Discussion of project 2

In project 1 we rewrote our original differential equation in terms of a discretized equation with approximations to the derivatives as $$ -\frac{u_{i+1} -2u_i +u_{i-i}}{h^2}=f(x_i,u(x_i)), $$ with $ i=1,2,\dots, n $. We need to add to this system the two boundary conditions $ u(a) =u_0 $ and $ u(b) = u_{n+1} $. If we define a matrix $$ {\bf A} = \frac{1}{h^2}\left(\begin{array}{cccccc} 2 & -1 & & & & \\ -1 & 2 & -1 & & & \\ & -1 & 2 & -1 & & \\ & \dots & \dots &\dots &\dots & \dots \\ & & &-1 &2& -1 \\ & & & &-1 & 2 \\ \end{array} \right) $$ and the corresponding vectors $ {\bf u} = (u_1, u_2, \dots,u_n)^T $ and $ {\bf f}({\bf u}) = f(x_1,x_2,\dots, x_n,u_1, u_2, \dots,u_n)^T $ we can rewrite the differential equation including the boundary conditions as a system of linear equations with a large number of unknowns $$ {\bf A}{\bf u} = {\bf f}({\bf u}). $$

Discussion of project 2

We are first interested in the solution of the radial part of Schroedinger's equation for one electron. This equation reads $$ -\frac{\hbar^2}{2 m} \left ( \frac{1}{r^2} \frac{d}{dr} r^2 \frac{d}{dr} - \frac{l (l + 1)}{r^2} \right )R(r) + V(r) R(r) = E R(r). $$ In our case $ V(r) $ is the harmonic oscillator potential $ (1/2)kr^2 $ with $ k=m\omega^2 $ and $ E $ is the energy of the harmonic oscillator in three dimensions. The oscillator frequency is $ \omega $ and the energies are $$ E_{nl}= \hbar \omega \left(2n+l+\frac{3}{2}\right), $$ with $ n=0,1,2,\dots $ and $ l=0,1,2,\dots $.

Discussion of project 2

Since we have made a transformation to spherical coordinates it means that $ r\in [0,\infty) $. The quantum number $ l $ is the orbital momentum of the electron. Then we substitute $ R(r) = (1/r) u(r) $ and obtain $$ -\frac{\hbar^2}{2 m} \frac{d^2}{dr^2} u(r) + \left ( V(r) + \frac{l (l + 1)}{r^2}\frac{\hbar^2}{2 m} \right ) u(r) = E u(r) . $$ The boundary conditions are $ u(0)=0 $ and $ u(\infty)=0 $.

Discussion of project 2

We introduce a dimensionless variable $ \rho = (1/\alpha) r $ where $ \alpha $ is a constant with dimension length and get $$ -\frac{\hbar^2}{2 m \alpha^2} \frac{d^2}{d\rho^2} u(\rho) + \left ( V(\rho) + \frac{l (l + 1)}{\rho^2} \frac{\hbar^2}{2 m\alpha^2} \right ) u(\rho) = E u(\rho) . $$ In project 2 we choose $ l=0 $. Inserting $ V(\rho) = (1/2) k \alpha^2\rho^2 $ we end up with $$ -\frac{\hbar^2}{2 m \alpha^2} \frac{d^2}{d\rho^2} u(\rho) + \frac{k}{2} \alpha^2\rho^2u(\rho) = E u(\rho) . $$ We multiply thereafter with $ 2m\alpha^2/\hbar^2 $ on both sides and obtain $$ -\frac{d^2}{d\rho^2} u(\rho) + \frac{mk}{\hbar^2} \alpha^4\rho^2u(\rho) = \frac{2m\alpha^2}{\hbar^2}E u(\rho) . $$

Discussion of project 2

We have thus $$ -\frac{d^2}{d\rho^2} u(\rho) + \frac{mk}{\hbar^2} \alpha^4\rho^2u(\rho) = \frac{2m\alpha^2}{\hbar^2}E u(\rho) . $$ The constant $ \alpha $ can now be fixed so that $$ \frac{mk}{\hbar^2} \alpha^4 = 1, $$ or $$ \alpha = \left(\frac{\hbar^2}{mk}\right)^{1/4}. $$ Defining $$ \lambda = \frac{2m\alpha^2}{\hbar^2}E, $$ we can rewrite Schr\"odinger's equation as $$ -\frac{d^2}{d\rho^2} u(\rho) + \rho^2u(\rho) = \lambda u(\rho) . $$ This is the first equation to solve numerically. In three dimensions the eigenvalues for $ l=0 $ are $ \lambda_0=3,\lambda_1=7,\lambda_2=11,\dots . $

Discussion of project 2

We use the by now standard expression for the second derivative of a function $ u $ $$ \begin{equation} u''=\frac{u(\rho+h) -2u(\rho) +u(\rho-h)}{h^2} +O(h^2), \label{eq:diffoperation} \end{equation} $$ where $ h $ is our step. Next we define minimum and maximum values for the variable $ \rho $, $ \rho_{\mathrm{min}}=0 $ and $ \rho_{\mathrm{max}} $, respectively. You need to check your results for the energies against different values $ \rho_{\mathrm{max}} $, since we cannot set $ \rho_{\mathrm{max}}=\infty $.

Discussion of project 2

With a given number of steps, $ n_{\mathrm{step}} $, we then define the step $ h $ as $$ h=\frac{\rho_{\mathrm{max}}-\rho_{\mathrm{min}} }{n_{\mathrm{step}}}. $$ Define an arbitrary value of $ \rho $ as $$ \rho_i= \rho_{\mathrm{min}} + ih \hspace{1cm} i=0,1,2,\dots , n_{\mathrm{step}} $$ we can rewrite the Schr\"odinger equation for $ \rho_i $ as $$ -\frac{u(\rho_i+h) -2u(\rho_i) +u(\rho_i-h)}{h^2}+\rho_i^2u(\rho_i) = \lambda u(\rho_i), $$ or in a more compact way $$ -\frac{u_{i+1} -2u_i +u_{i-1}}{h^2}+\rho_i^2u_i=-\frac{u_{i+1} -2u_i +u_{i-1} }{h^2}+V_iu_i = \lambda u_i, $$ where $ V_i=\rho_i^2 $ is the harmonic oscillator potential.

Discussion of project 2

Define first the diagonal matrix element $$ d_i=\frac{2}{h^2}+V_i, $$ and the non-diagonal matrix element $$ e_i=-\frac{1}{h^2}. $$ In this case the non-diagonal matrix elements are given by a mere constant. All non-diagonal matrix elements are equal.

With these definitions the Schroedinger equation takes the following form $$ d_iu_i+e_{i-1}u_{i-1}+e_{i+1}u_{i+1} = \lambda u_i, $$ where $ u_i $ is unknown. We can write the latter equation as a matrix eigenvalue problem $$ \begin{equation} \left( \begin{array}{ccccccc} d_1 & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & d_2 & e_2 & 0 & \dots &0 &0 \\ 0 & e_2 & d_3 & e_3 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &d_{n_{\mathrm{step}}-2} & e_{n_{\mathrm{step}}-1}\\ 0 & \dots & \dots & \dots &\dots &e_{n_{\mathrm{step}}-1} & d_{n_{\mathrm{step}}-1} \end{array} \right) \left( \begin{array}{c} u_{1} \\ u_{2} \\ \dots\\ \dots\\ \dots\\ u_{n_{\mathrm{step}}-1} \end{array} \right)=\lambda \left( \begin{array}{c} u_{1} \\ u_{2} \\ \dots\\ \dots\\ \dots\\ u_{n_{\mathrm{step}}-1} \end{array} \right) \label{eq:sematrix} \end{equation} $$ or if we wish to be more detailed, we can write the tridiagonal matrix as $$ \begin{equation} \left( \begin{array}{ccccccc} \frac{2}{h^2}+V_1 & -\frac{1}{h^2} & 0 & 0 & \dots &0 & 0 \\ -\frac{1}{h^2} & \frac{2}{h^2}+V_2 & -\frac{1}{h^2} & 0 & \dots &0 &0 \\ 0 & -\frac{1}{h^2} & \frac{2}{h^2}+V_3 & -\frac{1}{h^2} &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &\frac{2}{h^2}+V_{n_{\mathrm{step}}-2} & -\frac{1}{h^2}\\ 0 & \dots & \dots & \dots &\dots &-\frac{1}{h^2} & \frac{2}{h^2}+V_{n_{\mathrm{step}}-1} \end{array} \right) \label{eq:matrixse} \end{equation} $$ Recall that the solutions are known via the boundary conditions at $ i=n_{\mathrm{step}} $ and at the other end point, that is for $ \rho_0 $. The solution is zero in both cases.

Discussion of project 2

We are going to study two electrons in a harmonic oscillator well which also interact via a repulsive Coulomb interaction. Let us start with the single-electron equation written as $$ -\frac{\hbar^2}{2 m} \frac{d^2}{dr^2} u(r) + \frac{1}{2}k r^2u(r) = E^{(1)} u(r), $$ where $ E^{(1)} $ stands for the energy with one electron only. For two electrons with no repulsive Coulomb interaction, we have the following Schroedinger equation $$ \left( -\frac{\hbar^2}{2 m} \frac{d^2}{dr_1^2} -\frac{\hbar^2}{2 m} \frac{d^2}{dr_2^2}+ \frac{1}{2}k r_1^2+ \frac{1}{2}k r_2^2\right)u(r_1,r_2) = E^{(2)} u(r_1,r_2) . $$

Discussion of project 2

Note that we deal with a two-electron wave function $ u(r_1,r_2) $ and two-electron energy $ E^{(2)} $.

With no interaction this can be written out as the product of two single-electron wave functions, that is we have a solution on closed form.

We introduce the relative coordinate $ {\bf r} = {\bf r}_1-{\bf r}_2 $ and the center-of-mass coordinate $ {\bf R} = 1/2({\bf r}_1+{\bf r}_2) $. With these new coordinates, the radial Schr\"odinger equation reads $$ \left( -\frac{\hbar^2}{m} \frac{d^2}{dr^2} -\frac{\hbar^2}{4 m} \frac{d^2}{dR^2}+ \frac{1}{4} k r^2+ kR^2\right)u(r,R) = E^{(2)} u(r,R). $$

Discussion of project 2

The equations for $ r $ and $ R $ can be separated via the ansatz for the wave function $ u(r,R) = \psi(r)\phi(R) $ and the energy is given by the sum of the relative energy $ E_r $ and the center-of-mass energy $ E_R $, that is $$ E^{(2)}=E_r+E_R. $$ We add then the repulsive Coulomb interaction between two electrons, namely a term $$ V(r_1,r_2) = \frac{\beta e^2}{|{\bf r}_1-{\bf r}_2|}=\frac{\beta e^2}{r}, $$ with $ \beta e^2=1.44 $ eVnm.

Discussion of project 2

Adding this term, the $ r $-dependent Schroedinger equation becomes $$ \left( -\frac{\hbar^2}{m} \frac{d^2}{dr^2}+ \frac{1}{4}k r^2+\frac{\beta e^2}{r}\right)\psi(r) = E_r \psi(r). $$ This equation is similar to the one we had previously in parts (a) and (b) and we introduce again a dimensionless variable $ \rho = r/\alpha $. Repeating the same steps, we arrive at $$ -\frac{d^2}{d\rho^2} \psi(\rho) + \frac{mk}{4\hbar^2} \alpha^4\rho^2\psi(\rho)+\frac{m\alpha \beta e^2}{\rho\hbar^2}\psi(\rho) = \frac{m\alpha^2}{\hbar^2}E_r \psi(\rho) . $$

Discussion of project 2

We want to manipulate this equation further to make it as similar to that in (a) as possible. We define a 'frequency' $$ \omega_r^2=\frac{1}{4}\frac{mk}{\hbar^2} \alpha^4, $$ and fix the constant $ \alpha $ by requiring $$ \frac{m\alpha \beta e^2}{\hbar^2}=1 $$ or $$ \alpha = \frac{\hbar^2}{m\beta e^2}. $$

Discussion of project 2

Defining $$ \lambda = \frac{m\alpha^2}{\hbar^2}E, $$ we can rewrite Schroedinger's equation as $$ -\frac{d^2}{d\rho^2} \psi(\rho) + \omega_r^2\rho^2\psi(\rho) +\frac{1}{\rho}\psi(\rho) = \lambda \psi(\rho). $$

Discussion of project 2

We treat $ \omega_r $ as a parameter which reflects the strength of the oscillator potential.

Here we will study the cases $ \omega_r = 0.01 $, $ \omega_r = 0.5 $, $ \omega_r =1 $, and $ \omega_r = 5 $ for the ground state only, that is the lowest-lying state.

Discussion of project 2

With no repulsive Coulomb interaction you should get a result which corresponds to the relative energy of a non-interacting system. Make sure your results are stable as functions of $ \rho_{\mathrm{max}} $ and the number of steps.

We are only interested in the ground state with $ l=0 $. We omit the center-of-mass energy.

For specific oscillator frequencies, the above equation has analytic answers, see the article by M. Taut, Phys. Rev. A 48, 3561 - 3566 (1993). The article can be retrieved from the following web address http://prola.aps.org/abstract/PRA/v48/i5/p3561_1.

Discussion of Jacobi's method for eigenvalues

The general overview.

One speaks normally of two main approaches to solving the eigenvalue problem.

The first is the formal method, involving determinants and the characteristic polynomial. This proves how many eigenvalues there are, and is the way most of you learned about how to solve the eigenvalue problem, but for matrices of dimensions greater than 2 or 3, it is rather impractical.
The other general approach is to use similarity or unitary tranformations to reduce a matrix to diagonal form. This is normally done in two steps: first reduce to for example a tridiagonal form, and then to diagonal form. The main algorithms we will discuss in detail, Jacobi's and Householder's (so-called direct method) and Lanczos algorithms (an iterative method), follow this

methodology.

Discussion of Jacobi's method for eigenvalues

Direct or non-iterative methods require for matrices of dimensionality $ n\times n $ typically $ O(n^3) $ operations. These methods are normally called standard methods and are used for dimensionalities $ n \sim 10^5 $ or smaller. A brief historical overview

Year	$ n $
1950	$ n=20 $	(Wilkinson)
1965	$ n=200 $	(Forsythe et al.)
1980	$ n=2000 $	Linpack
1995	$ n=20000 $	Lapack
2012	$ n\sim 10^5 $	Lapack

shows that in the course of 60 years the dimension that direct diagonalization methods can handle has increased by almost a factor of $ 10^4 $. However, it pales beside the progress achieved by computer hardware, from flops to petaflops, a factor of almost $ 10^{15} $. We see clearly played out in history the $ O(n^3) $ bottleneck of direct matrix algorithms.

Sloppily speaking, when $ n\sim 10^4 $ is cubed we have $ O(10^{12}) $ operations, which is smaller than the $ 10^{15} $ increase in flops.

Discussion of Jacobi's method for eigenvalues

If the matrix to diagonalize is large and sparse, direct methods simply become impractical, also because many of the direct methods tend to destroy sparsity. As a result large dense matrices may arise during the diagonalization procedure. The idea behind iterative methods is to project the $n-$dimensional problem in smaller spaces, so-called Krylov subspaces. Given a matrix $ {\bf A} $ and a vector $ {\bf v} $, the associated Krylov sequences of vectors (and thereby subspaces) $ {\bf v} $, $ {\bf A}{\bf v} $, $ {\bf A}^2{\bf v} $, $ {\bf A}^3{\bf v},\dots $, represent successively larger Krylov subspaces.

Matrix	$ {\bf A}{\bf x}={\bf b} $	$ {\bf A}{\bf x}=\lambda{\bf x} $
$ {\bf A}={\bf A}^* $	Conjugate gradient	Lanczos
$ {\bf A}\ne {\bf A}^* $	GMRES etc	Arnoldi

Discussion of Jacobi's method for eigenvalues

LINPACK: package for linear equations and least square problems.
LAPACK:package for solving symmetric, unsymmetric and generalized eigenvalue problems. From LAPACK's website http://www.netlib.org it is possible to download for free all source codes from this library. Both C/C++ and Fortran versions are available.
BLAS (I, II and III): (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Blas I is vector operations, II vector-matrix operations and III matrix-matrix operations.

Discussion of Jacobi's method for eigenvalues

Consider an example of an ($ n\times n $) orthogonal transformation matrix $$ {\bf S}= \left( \begin{array}{cccccccc} 1 & 0 & \dots & 0 & 0 & \dots & 0 & 0 \\ 0 & 1 & \dots & 0 & 0 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & 0 & \dots \\ 0 & 0 & \dots & \cos\theta & 0 & \dots & 0 & \sin\theta \\ 0 & 0 & \dots & 0 & 1 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & 1 & \dots \\ 0 & 0 & \dots & -\sin\theta & 0 & \dots & 0 & \cos\theta \end{array} \right) $$ with property $ {\bf S^{T}} = {\bf S^{-1}} $. It performs a plane rotation around an angle $ \theta $ in the Euclidean $n-$dimensional space.

Discussion of Jacobi's method for eigenvalues

It means that its matrix elements that differ from zero are given by $$ s_{kk}= s_{ll}=\cos\theta, s_{kl}=-s_{lk}= -\sin\theta, s_{ii}=1\hspace{0.5cm} i\ne k \hspace{0.5cm} i \ne l, $$ A similarity transformation $$ {\bf B}= {\bf S}^T {\bf A}{\bf S}, $$ results in $$ \begin{eqnarray*} b_{ik} &=& a_{ik}\cos\theta - a_{il}\sin\theta , i \ne k, i \ne l \\ b_{il} &=& a_{il}\cos\theta + a_{ik}\sin\theta , i \ne k, i \ne l \nonumber\\ b_{kk} &=& a_{kk}\cos^2\theta - 2a_{kl}\cos\theta \sin\theta +a_{ll}\sin^2\theta\nonumber\\ b_{ll} &=& a_{ll}\cos^2\theta +2a_{kl}\cos\theta sin\theta +a_{kk}\sin^2\theta\nonumber\\ b_{kl} &=& (a_{kk}-a_{ll})\cos\theta \sin\theta +a_{kl}(\cos^2\theta-\sin^2\theta)\nonumber \end{eqnarray*} $$ The angle $ \theta $ is arbitrary. The recipe is to choose $ \theta $ so that all non-diagonal matrix elements $ b_{kl} $ become zero.

Discussion of Jacobi's method for eigenvalues

The main idea is thus to reduce systematically the norm of the off-diagonal matrix elements of a matrix $ {\bf A} $ $$ \mathrm{off}({\bf A}) = \sqrt{\sum_{i=1}^n\sum_{j=1,j\ne i}^n a_{ij}^2}. $$ To demonstrate the algorithm, we consider the simple $ 2\times 2 $ similarity transformation of the full matrix. The matrix is symmetric, we single out $ 1\le k < l \le n$ and use the abbreviations $ c=\cos\theta $ and $ s=\sin\theta $ to obtain $$ \left( \begin{array}{cc} b_{kk} & 0 \\ 0 & b_{ll} \\\end{array} \right) = \left( \begin{array}{cc} c & -s \\ s &c \\\end{array} \right) \left( \begin{array}{cc} a_{kk} & a_{kl} \\ a_{lk} &a_{ll} \\\end{array} \right) \left( \begin{array}{cc} c & s \\ -s & c \\\end{array} \right). $$

Discussion of Jacobi's method for eigenvalues

We require that the non-diagonal matrix elements $ b_{kl}=b_{lk}=0 $, implying that $$ a_{kl}(c^2-s^2)+(a_{kk}-a_{ll})cs = b_{kl} = 0. $$ If $ a_{kl}=0 $ one sees immediately that $ \cos\theta = 1 $ and $ \sin\theta=0 $.

Discussion of Jacobi's method for eigenvalues

The Frobenius norm of an orthogonal transformation is always preserved. The Frobenius norm is defined as $$ \mathrm{norm}({\bf A})_F = \sqrt{\sum_{i=1}^n\sum_{j=1}^n |a_{ij}|^2}. $$ This means that for our $ 2\times 2 $ case we have $$ 2a_{kl}^2+a_{kk}^2+a_{ll}^2 = b_{kk}^2+b_{ll}^2, $$ which leads to $$ \mathrm{off}({\bf B})^2 = \mathrm{norm}({\bf B})_F^2-\sum_{i=1}^nb_{ii}^2=\mathrm{off}({\bf A})^2-2a_{kl}^2, $$ since $$ \mathrm{norm}({\bf B})_F^2-\sum_{i=1}^nb_{ii}^2=\mathrm{norm}({\bf A})_F^2-\sum_{i=1}^na_{ii}^2+(a_{kk}^2+a_{ll}^2 -b_{kk}^2-b_{ll}^2). $$ This results means that the matrix $ {\bf A} $ moves closer to diagonal form for each transformation.

Discussion of Jacobi's method for eigenvalues

Defining the quantities $ \tan\theta = t= s/c $ and $$\cot 2\theta=\tau = \frac{a_{ll}-a_{kk}}{2a_{kl}}, $$ we obtain the quadratic equation (using $ \cot 2\theta=1/2(\cot \theta-\tan\theta) $ $$ t^2+2\tau t-1= 0, $$ resulting in $$ t = -\tau \pm \sqrt{1+\tau^2}, $$ and $ c $ and $ s $ are easily obtained via $$ c = \frac{1}{\sqrt{1+t^2}}, $$ and $ s=tc $. Choosing $ t $ to be the smaller of the roots ensures that $ |\theta| \le \pi/4 $ and has the effect of minimizing the difference between the matrices $ {\bf B} $ and $ {\bf A} $ since $$ \mathrm{norm}({\bf B}-{\bf A})_F^2=4(1-c)\sum_{i=1,i\ne k,l}^n(a_{ik}^2+a_{il}^2) +\frac{2a_{kl}^2}{c^2}. $$

Discussion of Jacobi's method for eigenvalues

Choose a tolerance $ \epsilon $, making it a small number, typically $ 10^{-8} $ or smaller.
Setup a while test where one compares the norm of the newly computed off-diagonal

matrix elements \[ \mathrm{off}({\bf A}) = \sqrt{\sum_{i=1}^n\sum_{j=1,j\ne i}^n a_{ij}^2} > \epsilon. \]

Now choose the matrix elements $ a_{kl} $ so that we have those with largest value, that is $ |a_{kl}|=\mathrm{max}_{i\ne j} |a_{ij}| $.

Compute thereafter $ \tau = (a_{ll}-a_{kk})/2a_{kl} $, $ \tan\theta $, $ \cos\theta $ and $ \sin\theta $.

Compute thereafter the similarity transformation for this set of values $ (k,l) $, obtaining the new matrix $ {\bf B}= {\bf S}(k,l,\theta)^T {\bf A}{\bf S}(k,l,\theta) $.
Compute the new norm of the off-diagonal matrix elements and continue till you have satisfied $ \mathrm{off}({\bf B}) \le \epsilon $

Discussion of Jacobi's method for eigenvalues

The convergence rate of the Jacobi method is however poor, one needs typically $ 3n^2-5n^2 $ rotations and each rotation requires $ 4n $ operations, resulting in a total of $ 12n^3-20n^3 $ operations in order to zero out non-diagonal matrix elements.

Discussion of Jacobi's method for eigenvalues

We specialize to a symmetric $ 3\times 3 $ matrix $ {\bf A} $. We start the process as follows (assuming that $ a_{23}=a_{32} $ is the largest non-diagonal) with $ c=\cos{\theta} $ and $ s=\sin{\theta} $ $$ {\bf B} = \left( \begin{array}{ccc} 1 & 0 & 0 \\ 0 & c & -s \\ 0 & s & c \end{array} \right)\left( \begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{array} \right) \left( \begin{array}{ccc} 1 & 0 & 0 \\ 0 & c & s \\ 0 & -s & c \end{array} \right). $$ We will choose the angle $ \theta $ in order to have $ a_{23}=a_{32}=0 $. We get (symmetric matrix) $$ {\bf B} =\left( \begin{array}{ccc} a_{11} & a_{12}c -a_{13}s& a_{12}s+a_{13}c \\ a_{12}c -a_{13}s & a_{22}c^2+a_{33}s^2 -2a_{23}sc& (a_{22}-a_{33})sc +a_{23}(c^2-s^2) \\ a_{12}s+a_{13}c & (a_{22}-a_{33})sc +a_{23}(c^2-s^2) & a_{22}s^2+a_{33}c^2 +2a_{23}sc \end{array} \right). $$ Note that $ a_{11} $ is unchanged! As it should.

Discussion of Jacobi's method for eigenvalues

We have $$ {\bf B} =\left( \begin{array}{ccc} a_{11} & a_{12}c -a_{13}s& a_{12}s+a_{13}c \\ a_{12}c -a_{13}s & a_{22}c^2+a_{33}s^2 -2a_{23}sc& (a_{22}-a_{33})sc +a_{23}(c^2-s^2) \\ a_{12}s+a_{13}c & (a_{22}-a_{33})sc +a_{23}(c^2-s^2) & a_{22}s^2+a_{33}c^2 +2a_{23}sc \end{array} \right). $$ or $$ \begin{eqnarray*} b_{11} &=& a_{11} \\ b_{12} &=& a_{12}\cos\theta - a_{13}\sin\theta , 1 \ne 2, 1 \ne 3 \\ b_{13} &=& a_{13}\cos\theta + a_{12}\sin\theta , 1 \ne 2, 1 \ne 3 \nonumber\\ b_{22} &=& a_{22}\cos^2\theta - 2a_{23}\cos\theta \sin\theta +a_{33}\sin^2\theta\nonumber\\ b_{33} &=& a_{33}\cos^2\theta +2a_{23}\cos\theta \sin\theta +a_{22}\sin^2\theta\nonumber\\ b_{23} &=& (a_{22}-a_{33})\cos\theta \sin\theta +a_{23}(\cos^2\theta-\sin^2\theta)\nonumber \end{eqnarray*} $$ We will fix the angle $ \theta $ so that $ b_{23}=0 $.

Discussion of Jacobi's method for eigenvalues

We get then a new matrix $$ {\bf B} =\left( \begin{array}{ccc} b_{11} & b_{12}& b_{13} \\ b_{12}& b_{22}& 0 \\ b_{13}& 0& a_{33} \end{array} \right). $$ We repeat then assuming that $ b_{12} $ is the largest non-diagonal matrix element and get a new matrix $$ {\bf C} = \left( \begin{array}{ccc} c & -s & 0 \\ s & c & 0 \\ 0 & 0 & 1 \end{array} \right)\left( \begin{array}{ccc} b_{11} & b_{12} & b_{13} \\ b_{12} & b_{22} & 0 \\ b_{13} & 0 & b_{33} \end{array} \right) \left( \begin{array}{ccc} c & s & 0 \\ -s & c & 0 \\ 0 & 0 & 1 \end{array} \right). $$ We continue this process till all non-diagonal matrix elements are zero (ideally). You will notice that performing the above operations that the matrix element $ b_{23} $ which was previous zero becomes different from zero. This is one of the problems which slows down the jacobi procedure.

Discussion of Jacobi's method for eigenvalues

The more general expression for the new matrix elements are $$ \begin{eqnarray*} b_{ii} &=& a_{ii}, i \ne k, i \ne l \\ b_{ik} &=& a_{ik}\cos\theta - a_{il}\sin\theta , i \ne k, i \ne l \\ b_{il} &=& a_{il}\cos\theta + a_{ik}\sin\theta , i \ne k, i \ne l \nonumber\\ b_{kk} &=& a_{kk}\cos^2\theta - 2a_{kl}\cos\theta \sin\theta +a_{ll}\sin^2\theta\nonumber\\ b_{ll} &=& a_{ll}\cos^2\theta +2a_{kl}\cos\theta \sin\theta +a_{kk}\sin^2\theta\nonumber\\ b_{kl} &=& (a_{kk}-a_{ll})\cos\theta \sin\theta +a_{kl}(\cos^2\theta-\sin^2\theta)\nonumber \end{eqnarray*} $$ This is what we will need to code.

Discussion of Jacobi's method for eigenvalues

Code example.

//  we have defined a matrix A and a matrix R for the eigenvector, both of dim n x n
//  The final matrix R has the eigenvectors in its row elements, it is set to one
//  for the diagonal elements in the beginning, zero else.
....
double tolerance = 1.0E-10; 
int iterations = 0;
while ( maxnondiag > tolerance && iterations <= maxiter)
{
   int p, q;
   maxnondiag  = offdiag(A, p, q, n);
   Jacobi_rotate(A, R, p, q, n);
   iterations++;
}
...

Discussion of Jacobi's method for eigenvalues

Finding the max nondiagonal element

//  the offdiag function, using Armadillo
double offdiag(mat A, int p, int q, int n);
{
   double max;
   for (int i = 0; i < n; ++i)
   {
       for ( int j = i+1; j < n; ++j)
       {
           double aij = fabs(A(i,j));
           if ( aij > max)
           { 
              max = aij;  p = i; q = j;
           }
       }
   }
   return max;
}
// more statements

Discussion of Jacobi's method for eigenvalues

Finding the new matrix elements

void Jacobi_rotate ( mat A, mat R, int k, int l, int n )
{
  double s, c;
  if ( A(k,l) != 0.0 ) {
    double t, tau;
    tau = (A(l,l) - A(k,k))/(2*A(k,l));
    
    if ( tau >= 0 ) {
      t = 1.0/(tau + sqrt(1.0 + tau*tau));
    } else {
      t = -1.0/(-tau +sqrt(1.0 + tau*tau));
    }
    
    c = 1/sqrt(1+t*t);
    s = c*t;
  } else {
    c = 1.0;
    s = 0.0;
  }
  double a_kk, a_ll, a_ik, a_il, r_ik, r_il;
  a_kk = A(k,k);
  a_ll = A(l,l);
  A(k,k) = c*c*a_kk - 2.0*c*s*A(k,l) + s*s*a_ll;
  A(l,l) = s*s*a_kk + 2.0*c*s*A(k,l) + c*c*a_ll;
  A(k,l) = 0.0;  // hard-coding non-diagonal elements by hand
  A(l,k) = 0.0;  // same here
  for ( int i = 0; i < n; i++ ) {
    if ( i != k && i != l ) {
      a_ik = A(i,k);
      a_il = A(i,l);
      A(i,k) = c*a_ik - s*a_il;
      A(k,i) = A(i,k);
      A(i,l) = c*a_il + s*a_ik;
      A(l,i) = A(i,l);
    }
//  And finally the new eigenvectors
    r_ik = R(i,k);
    r_il = R(i,l);

    R(i,k) = c*r_ik - s*r_il;
    R(i,l) = c*r_il + s*r_ik;
  }
  return;
} // end of function jacobi_rotate

Discussion of Householder's method for eigenvalues

The drawbacks with Jacobi's method are rather obvious, with perhaps the most negative feature being the fact that we cannot tell * a priori* how many transformations are needed. Can we do better? The answer to this is yes and is given by a clever algorithm outlined by Householder. It was ranked among the top ten algorithms in the previous century. We will discuss this algorithm in more detail during week 38.

The first step consists in finding an orthogonal matrix $ {\bf S} $ which is the product of $ (n-2) $ orthogonal matrices $$ {\bf S}={\bf S}_1{\bf S}_2\dots{\bf S}_{n-2}, $$ each of which successively transforms one row and one column of $ {\bf A} $ into the required tridiagonal form. Only $ n-2 $ transformations are required, since the last two elements are already in tridiagonal form. In order to determine each $ {\bf S_i} $ let us see what happens after the first multiplication, namely, $$ {\bf S}_1^T{\bf A}{\bf S}_1= \left( \begin{array}{ccccccc} a_{11} & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & a'_{22} &a'_{23} & \dots & \dots &\dots &a'_{2n} \\ 0 & a'_{32} &a'_{33} & \dots & \dots &\dots &a'_{3n} \\ 0 & \dots &\dots & \dots & \dots &\dots & \\ 0 & a'_{n2} &a'_{n3} & \dots & \dots &\dots &a'_{nn} \\ \end{array} \right) $$ where the primed quantities represent a matrix $ {\bf A}' $ of dimension $ n-1 $ which will subsequentely be transformed by $ {\bf S_2} $.

Discussion of Householder's method for eigenvalues

The factor $ e_1 $ is a possibly non-vanishing element. The next transformation produced by $ {\bf S_2} $ has the same effect as $ {\bf S_1} $ but now on the submatirx $ {\bf A^{'}} $ only $$ \left ({\bf S}_{1}{\bf S}_{2} \right )^{T} {\bf A}{\bf S}_{1} {\bf S}_{2} = \left( \begin{array}{ccccccc} a_{11} & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & a'_{22} &e_2 & 0 & \dots &\dots &0 \\ 0 & e_2 &a''_{33} & \dots & \dots &\dots &a''_{3n} \\ 0 & \dots &\dots & \dots & \dots &\dots & \\ 0 & 0 &a''_{n3} & \dots & \dots &\dots &a''_{nn} \\ \end{array} \right) $$ Note that the effective size of the matrix on which we apply the transformation reduces for every new step. In the previous Jacobi method each similarity transformation is in principle performed on the full size of the original matrix.

Discussion of Householder's method for eigenvalues

After a series of such transformations, we end with a set of diagonal matrix elements $$ a_{11}, a'_{22}, a''_{33}\dots a^{n-1}_{nn}, $$ and off-diagonal matrix elements $$ e_1, e_2,e_3, \dots, e_{n-1}. $$ The resulting matrix reads $$ {\bf S}^{T} {\bf A} {\bf S} = \left( \begin{array}{ccccccc} a_{11} & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & a'_{22} & e_2 & 0 & \dots &0 &0 \\ 0 & e_2 & a''_{33} & e_3 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &a^{(n-1)}_{n-2} & e_{n-1}\\ 0 & \dots & \dots & \dots &\dots &e_{n-1} & a^{(n-1)}_{n-1} \end{array} \right) . $$

Discussion of Householder's method for eigenvalues

It remains to find a recipe for determining the transformation $ {\bf S}_n $. We illustrate the method for $ {\bf S}_1 $ which we assume takes the form $$ {\bf S_{1}} = \left( \begin{array}{cc} 1 & {\bf 0^{T}} \\ {\bf 0}& {\bf P} \end{array} \right), $$ with $ {\bf 0^{T}} $ being a zero row vector, $ {\bf 0^{T}} = \{0,0,\cdots\} $ of dimension $ (n-1) $. The matrix $ {\bf P} $ is symmetric with dimension ($ (n-1) \times (n-1) $) satisfying $ {\bf P}^2={\bf I} $ and $ {\bf P}^T={\bf P} $. A possible choice which fullfils the latter two requirements is $$ {\bf P}={\bf I}-2{\bf u}{\bf u}^T, $$ where $ {\bf I} $ is the $ (n-1) $ unity matrix and $ {\bf u} $ is an $ n-1 $ column vector with norm $ {\bf u}^T{\bf u} $ (inner product).

Discussion of Householder's method for eigenvalues

Note that $ {\bf u}{\bf u}^T $ is an outer product giving a matrix of dimension ($ (n-1) \times (n-1) $). Each matrix element of $ {\bf P} $ then reads $$ P_{ij}=\delta_{ij}-2u_iu_j, $$ where $ i $ and $ j $ range from $ 1 $ to $ n-1 $. Applying the transformation $ {\bf S}_1 $ results in $$ {\bf S}_1^T{\bf A}{\bf S}_1 = \left( \begin{array}{cc} a_{11} & ({\bf Pv})^T \\ {\bf Pv}& {\bf A}' \end{array} \right) , $$ where $ {\bf v^{T}} = \{a_{21}, a_{31},\cdots, a_{n1}\} $ and {\bf P} must satisfy ($ {\bf Pv})^{T} = \{k, 0, 0,\cdots \} $. Then $$ \begin{equation} {\bf Pv} = {\bf v} -2{\bf u}( {\bf u}^T{\bf v})= k {\bf e}, \label{eq:palpha} \end{equation} $$ with $ {\bf e^{T}} = \{1,0,0,\dots 0\} $.

Discussion of Householder's method for eigenvalues

Solving the latter equation gives us $ {\bf u} $ and thus the needed transformation $ {\bf P} $. We do first however need to compute the scalar $ k $ by taking the scalar product of the last equation with its transpose and using the fact that $ {\bf P}^2={\bf I} $. We get then $$ ({\bf Pv})^T{\bf Pv} = k^{2} = {\bf v}^T{\bf v}= |v|^2 = \sum_{i=2}^{n}a_{i1}^2, $$ which determines the constant $ k = \pm v$.

Discussion of Householder's method for eigenvalues

Now we can rewrite Eq. \eqref{eq:palpha} as $$ {\bf v} - k{\bf e} = 2{\bf u}( {\bf u}^T{\bf v}), $$ and taking the scalar product of this equation with itself and obtain $$ \begin{equation} 2( {\bf u}^T{\bf v})^2=(v^2\pm a_{21}v), \label{eq:pmalpha} \end{equation} $$ which finally determines $$ {\bf u}=\frac{{\bf v}-k{\bf e}}{2( {\bf u}^T{\bf v})}. $$ In solving Eq. \eqref{eq:pmalpha} great care has to be exercised so as to choose those values which make the right-hand largest in order to avoid loss of numerical precision. The above steps are then repeated for every transformations till we have a tridiagonal matrix suitable for obtaining the eigenvalues.

Discussion of Householder's method for eigenvalues

Our Householder transformation has given us a tridiagonal matrix. We discuss here how one can use Householder's iterative procedure to obtain the eigenvalues. Let us specialize to a $ 4\times 4 $ matrix. The tridiagonal matrix takes the form $$ {\bf A} = \left( \begin{array}{cccc} d_{1} & e_{1} & 0 & 0 \\ e_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e_{3} \\ 0 & 0 & e_{3} & d_{4} \end{array} \right). $$ As a first observation, if any of the elements $ e_{i} $ are zero the matrix can be separated into smaller pieces before diagonalization. Specifically, if $ e_{1} = 0 $ then $ d_{1} $ is an eigenvalue.

Discussion of Householder's method for eigenvalues

Then the similarity transformation $$ {\bf S_{1}^{T} A S_{1}} = {\bf A'} = \left( \begin{array}{cccc} d'_{1} & e'_{1} & 0 & 0 \\ e'_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e'{3} \\ 0 & 0 & e'_{3} & d'_{4} \end{array} \right) $$ produces a matrix where the primed elements in $ {\bf A'} $ have been changed by the transformation whereas the unprimed elements are unchanged. If we now choose $ \theta $ to give the element $ a_{21}^{'} = e^{'}= 0 $ then we have the first eigenvalue $ = a_{11}^{'} = d_{1}^{'} $. (This is actually what you are doing in project 2!!)

Discussion of Householder's method for eigenvalues

This procedure can be continued on the remaining three-dimensional submatrix for the next eigenvalue. Thus after few transformations we have the wanted diagonal form.

What we see here is just a special case of the more general procedure developed by Francis in two articles in 1961 and 1962.

The algorithm is based on the so-called QR method (or just QR-algorithm). It follows from a theorem by Schur which states that any square matrix can be written out in terms of an orthogonal matrix $ {\bf Q} $ and an upper triangular matrix $ {\bf U} $. Historically R was used instead of U since the wording right triangular matrix was first used. The method is based on an iterative procedure similar to Jacobi's method, by a succession of planar rotations. For a tridiagonal matrix it is simple to carry out in principle, but complicated in detail! We will discuss this in more detail during week 38.

Week 38

Overview of week 38

Eigenvalue problems.

Monday: Brief repetition from last week, with discussion of project 2.
Mathematical aspects of Jacobi's algorithm
Discussion of Householder's algorithm
Tuesday: More on Householder's algorithm and diagonalization of tridiagonal matrices (Givens and Francis' algorithms)
Discussion of Lanczos' method (also optional part of project 2)
Lab: discussion of unit tests and how to write a scientific report

Reading assignment this week: chapters 7.1-7.4 of lecture notes

Eigenvalues with the QR and Lanczos methods

Our Householder transformation has given us a tridiagonal matrix. We discuss here how one can use Jacobi's iterative procedure to obtain the eigenvalues, although it may not be the best approach. Let us specialize to a $ 4\times 4 $ matrix. The tridiagonal matrix takes the form $$ {\bf A} = \left( \begin{array}{cccc} d_{1} & e_{1} & 0 & 0 \\ e_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e_{3} \\ 0 & 0 & e_{3} & d_{4} \end{array} \right). $$ As a first observation, if any of the elements $ e_{i} $ are zero the matrix can be separated into smaller pieces before diagonalization. Specifically, if $ e_{1} = 0 $ then $ d_{1} $ is an eigenvalue.

Eigenvalues with the QR and Lanczos methods

Thus, let us introduce a transformation $ {\bf S_{1}} $ which operates like $$ {\bf S_{1}} = \left( \begin{array}{cccc} \cos \theta & 0 & 0 & \sin \theta\\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ \cos \theta & 0 & 0 & \cos \theta \end{array} \right) $$ Then the similarity transformation $$ {\bf S_{1}^{T} A S_{1}} = {\bf A'} = \left( \begin{array}{cccc} d'_{1} & e'_{1} & 0 & 0 \\ e'_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e'{3} \\ 0 & 0 & e'_{3} & d'_{4} \end{array} \right) $$ produces a matrix where the primed elements in $ {\bf A'} $ have been changed by the transformation whereas the unprimed elements are unchanged. If we now choose $ \theta $ to give the element $ a_{21}^{'} = e^{'}= 0 $ then we have the first eigenvalue $ = a_{11}^{'} = d_{1}^{'} $.

This procedure can be continued on the remaining three-dimensional submatrix for the next eigenvalue. Thus after few transformations we have the wanted diagonal form.

What we see here is just a special case of the more general procedure developed by Francis in two articles in 1961 and 1962. Using Jacobi's method is not very efficient ether.

The algorithm is based on the so-called {\bf QR} method (or just {\bf QR}-algorithm). It follows from a theorem by Schur which states that any square matrix can be written out in terms of an orthogonal matrix $ \hat{Q} $ and an upper triangular matrix $ \hat{U} $. Historically $ R $ was used instead of $ U $ since the wording right triangular matrix was first used.

Eigenvalues with the QR and Lanczos methods

The method is based on an iterative procedure similar to Jacobi's method, by a succession of planar rotations. For a tridiagonal matrix it is simple to carry out in principle, but complicated in detail!

Schur's theorem $$ \hat{A} = \hat{Q}\hat{U}, $$ is used to rewrite any square matrix into a unitary matrix times an upper triangular matrix. We say that a square matrix is similar to a triangular matrix.

Householder's algorithm which we have derived is just a special case of the general Householder algorithm. For a symmetric square matrix we obtain a tridiagonal matrix.

There is a corollary to Schur's theorem which states that every Hermitian matrix is unitarily similar to a diagonal matrix.

Eigenvalues with the QR and Lanczos methods

It follows that we can define a new matrix $$ \hat{A}\hat{Q} = \hat{Q}\hat{U}\hat{Q}, $$ and multiply from the left with $ \hat{Q}^{-1} $ we get $$ \hat{Q}^{-1}\hat{A}\hat{Q} = \hat{B}=\hat{U}\hat{Q}, $$ where the matrix $ \hat{B} $ is a similarity transformation of $ \hat{A} $ and has the same eigenvalues as $ \hat{B} $.

Eigenvalues with the QR and Lanczos methods

Suppose $ \hat{A} $ is the triangular matrix we obtained after the Householder transformation, $$ \hat{A} = \hat{Q}\hat{U}, $$ and multiply from the left with $ \hat{Q}^{-1} $ resulting in $$ \hat{Q}^{-1}\hat{A} = \hat{U}. $$ Suppose that $ \hat{Q} $ consists of a series of planar Jacobi like rotations acting on sub blocks of $ \hat{A} $ so that all elements below the diagonal are zeroed out $$ \hat{Q}=\hat{R}_{12}\hat{R}_{23}\dots\hat{R}_{n-1,n}. $$

Eigenvalues with the QR and Lanczos methods

A transformation of the type $ \hat{R}_{12} $ looks like $$ \hat{R}_{12} = \left( \begin{array}{ccccccccc} c&s &0 &0 &0 & \dots &0 & 0 & 0\\ -s&c &0 &0 &0 & \dots &0 & 0 & 0\\ 0&0 &1 &0 &0 & \dots &0 & 0 & 0\\ \dots&\dots &\dots &\dots &\dots &\dots \\ 0&0 &0 & 0 & 0 & \dots &1 &0 &0 \\ 0&0 &0 & 0 & 0 & \dots &0 &1 &0 \\ 0&0 &0 & 0 & 0 & \dots &0 &0 & 1 \end{array} \right) $$

Eigenvalues with the QR and Lanczos methods

The matrix $ \hat{U} $ takes then the form $$ \hat{U} = \left( \begin{array}{ccccccccc} x&x &x &0 &0 & \dots &0 & 0 & 0\\ 0&x &x &x &0 & \dots &0 & 0 & 0\\ 0&0 &x &x &x & \dots &0 & 0 & 0\\ \dots&\dots &\dots &\dots &\dots &\dots \\ 0&0 &0 & 0 & 0 & \dots &x &x &x \\ 0&0 &0 & 0 & 0 & \dots &0 &x &x \\ 0&0 &0 & 0 & 0 & \dots &0 &0 & x \end{array} \right) $$ which has a second superdiagonal.

Eigenvalues with the QR and Lanczos methods

We have now found $ \hat{Q} $ and $ \hat{U} $ and this allows us to find the matrix $ \hat{B} $ which is, due to Schur's theorem, unitarily similar to a triangular matrix (upper in our case) since we have that $$ \hat{Q}^{-1}\hat{A}\hat{Q} = \hat{B}, $$ from Schur's theorem the matrix $ \hat{B} $ is triangular and the eigenvalues the same as those of $ \hat{A} $ and are given by the diagonal matrix elements of $ \hat{B} $. Why?

Our matrix $ \hat{B}=\hat{U}\hat{Q} $.

Eigenvalues with the QR and Lanczos methods

The matrix $ \hat{A} $ is transformed into a tridiagonal form and the last step is to transform it into a diagonal matrix giving the eigenvalues on the diagonal.

The eigenvalues of a matrix can be obtained using the characteristic polynomial $$ P(\lambda) = det(\lambda{\bf I}-{\bf A})= \prod_{i=1}^{n}\left(\lambda_i-\lambda\right), $$ which rewritten in matrix form reads $$ P(\lambda)= \left( \begin{array}{ccccccc} d_1-\lambda & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & d_2-\lambda & e_2 & 0 & \dots &0 &0 \\ 0 & e_2 & d_3-\lambda & e_3 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &d_{N_{\mathrm{step}}-2}-\lambda & e_{N_{\mathrm{step}}-1}\\ 0 & \dots & \dots & \dots &\dots &e_{N_{\mathrm{step}}-1} & d_{N_{\mathrm{step}}-1}-\lambda \end{array} \right) $$ We can solve this equation in an iterative manner. We let $ P_k(\lambda) $ be the value of $ k $ subdeterminant of the above matrix of dimension $ n\times n $. The polynomial $ P_k(\lambda) $ is clearly a polynomial of degree $ k $. Starting with $ P_1(\lambda) $ we have $ P_1(\lambda)=d_1-\lambda $. The next polynomial reads $ P_2(\lambda)=(d_2-\lambda)P_1(\lambda)-e_1^2 $. By expanding the determinant for $ P_k(\lambda) $ in terms of the minors of the $n$th column we arrive at the recursion relation \[ P_k(\lambda)=(d_k-\lambda)P_{k-1}(\lambda)-e_{k-1}^2P_{k-2}(\lambda). \] Together with the starting values $ P_1(\lambda) $ and $ P_2(\lambda) $ and good root searching methods we arrive at an efficient computational scheme for finding the roots of $ P_n(\lambda) $. However, for large matrices this algorithm is rather inefficient and time-consuming.

Eigenvalues with the QR and Lanczos methods

Basic features with a real symmetric matrix (and normally huge $ n> 10^6 $ and sparse) $ \hat{A} $ of dimension $ n\times n $:

Lanczos' algorithm generates a sequence of real tridiagonal matrices $ T_k $ of dimension $ k\times k $ with $ k\le n $, with the property that the extremal eigenvalues of $ T_k $ are progressively better estimates of $ \hat{A} $' extremal eigenvalues.* The method converges to the extremal eigenvalues.
The similarity transformation is

$$ \hat{T}= \hat{Q}^{T}\hat{A}\hat{Q}, $$ with the first vector $ \hat{Q}\hat{e}_1=\hat{q}_1 $.

We are going to solve iteratively $$ \hat{T}= \hat{Q}^{T}\hat{A}\hat{Q}, $$ with the first vector $ \hat{Q}\hat{e}_1=\hat{q}_1 $. We can write out the matrix $ \hat{Q} $ in terms of its column vectors $$ \hat{Q}=\left[\hat{q}_1\hat{q}_2\dots\hat{q}_n\right]. $$

Eigenvalues with the QR and Lanczos methods

The matrix $$ \hat{T}= \hat{Q}^{T}\hat{A}\hat{Q}, $$ can be written as $$ \hat{T} = \left(\begin{array}{cccccc} \alpha_1& \beta_1 & 0 &\dots & \dots &0 \\ \beta_1 & \alpha_2 & \beta_2 &0 &\dots &0 \\ 0& \beta_2 & \alpha_3 & \beta_3 & \dots &0 \\ \dots& \dots & \dots &\dots &\dots & 0 \\ \dots& & &\beta_{n-2} &\alpha_{n-1}& \beta_{n-1} \\ 0& \dots &\dots &0 &\beta_{n-1} & \alpha_{n} \\ \end{array} \right) $$

Eigenvalues with the QR and Lanczos methods

Using the fact that $ \hat{Q}\hat{Q}^T=\hat{I} $, we can rewrite $$ \hat{T}= \hat{Q}^{T}\hat{A}\hat{Q}, $$ as $$ \hat{Q}\hat{T}= \hat{A}\hat{Q}, $$ and if we equate columns $$ \hat{T} = \left(\begin{array}{cccccc} \alpha_1& \beta_1 & 0 &\dots & \dots &0 \\ \beta_1 & \alpha_2 & \beta_2 &0 &\dots &0 \\ 0& \beta_2 & \alpha_3 & \beta_3 & \dots &0 \\ \dots& \dots & \dots &\dots &\dots & 0 \\ \dots& & &\beta_{n-2} &\alpha_{n-1}& \beta_{n-1} \\ 0& \dots &\dots &0 &\beta_{n-1} & \alpha_{n} \\ \end{array} \right) $$ we obtain $$ \hat{A}\hat{q}_k=\beta_{k-1}\hat{q}_{k-1}+\alpha_k\hat{q}_k+\beta_k\hat{q}_{k+1}. $$

Eigenvalues with the QR and Lanczos methods

We have thus $$ \hat{A}\hat{q}_k=\beta_{k-1}\hat{q}_{k-1}+\alpha_k\hat{q}_k+\beta_k\hat{q}_{k+1}, $$ with $ \beta_0\hat{q}_0=0 $ for $ k=1:n-1 $. Remember that the vectors $ \hat{q}_k $ are orthornormal and this implies $$ \alpha_k=\hat{q}_k^T\hat{A}\hat{q}_k, $$ and these vectors are called Lanczos vectors.

Eigenvalues with the QR and Lanczos methods

We have thus $$ \hat{A}\hat{q}_k=\beta_{k-1}\hat{q}_{k-1}+\alpha_k\hat{q}_k+\beta_k\hat{q}_{k+1}, $$ with $ \beta_0\hat{q}_0=0 $ for $ k=1:n-1 $ and $$ \alpha_k=\hat{q}_k^T\hat{A}\hat{q}_k. $$ If $$ \hat{r}_k=(\hat{A}-\alpha_k\hat{I})\hat{q}_k-\beta_{k-1}\hat{q}_{k-1}, $$ is non-zero, then $$ \hat{q}_{k+1}=\hat{r}_{k}/\beta_k, $$ with $ \beta_k=\pm ||\hat{r}_{k}||_2 $.

The report: how to write a good scienfitic/technical report

What should it contain? A typical structure.

An introduction where you explain the aims and rationale for the physics case and what you have done. At the end of the introduction you should give a brief summary of the structure of the report
Theoretical models and technicalities. This is the methods section.
Results and discussion
Conclusions and perspectives
Appendix with extra material
Bibliography

Keep always a good log of what you do.

The report

What should I focus on? Introduction.

You don't need to answer all questions in a chronological order. When you write the introduction you could focus on the following aspects

Motivate the reader, the first part of the introduction gives always a motivation and tries to give the overarching ideas
What I have done
The structure of the report, how it is organized etc

The report

What should I focus on? Methods sections.

Describe the methods and algorithms
You need to explain how you implemented the methods and also say something about the structure of your algorithm and present some parts of your code
You should plug in some calculations to demonstrate your code, such as selected runs used to validate and verify your results. The latter is extremely important!! A reader needs to understand that your code reproduces selected benchmarks and reproduces previous results, either numerical and/or well-known closed form expressions.

The report

What should I focus on? Results.

Present your results
Give a critical discussion of your work and place it in the correct context.
Relate your work to other calculations/studies
An eventual reader should be able to reproduce your calculations if she/he wants to do so. All input variables should be properly explained.
Make sure that figures and tables should contain enough information in their captions, axis labels etc so that an eventual reader can gain a first impression of your work by studying figures and tables only.

The report

What should I focus on? Conclusions.

State your main findings and interpretations
Try as far as possible to present perspectives for future work
Try to discuss the pros and cons of the methods and possible improvements

The report

What should I focus on? additional material.

Additional calculations used to validate the codes
Selected calculations, these can be listed with few comments
Listing of the code if you feel this is necessary

You can consider moving parts of the material from the methods section to the appendix. You can also place additional material on your webpage.

The report

What should I focus on? References.

Give always references to material you base your work on, either scientific articles/reports or books.
Refer to articles as: name(s) of author(s), journal, volume (boldfaced), page and year in parenthesis.
Refer to books as: name(s) of author(s), title of book, publisher, place and year, eventual page numbers

Unit Testing

Unit Testing is the practice of testing the smallest testable parts, called units, of an application individually and independently to determine if they behave exactly as expected. Unit tests (short code fragments) are usually written such that they can be preformed at any time during the development to continually verify the behavior of the code. In this way, possible bugs will be identified early in the development cycle, making the debugging at later stage much easier. There are many benefits associated with Unit Testing, such as

It increases confidence in changing and maintaining code. Big changes can be made to the code quickly, since the tests will ensure that everything still is working properly.
Since the code needs to be modular to make Unit Testing possible, the code will be easier to reuse. This improves the code design.
Debugging is easier, since when a test fails, only the latest changes need to be debugged.

Different parts of a project can be tested without the need to wait for the other parts to be available.

A unit test can serve as a documentation on the functionality of a unit of the code.

Simple example of unit test

Look up the guide on how to install unit tests for c++ at course webpage. This is the version with classes.

#include <unittest++/UnitTest++.h>

class MyMultiplyClass{
public:
    double multiply(double x, double y) {
        return x * y;
    }
};

TEST(MyMath) {
    MyMultiplyClass my;
    CHECK_EQUAL(56, my.multiply(7,8));
}

int main()
{
    return UnitTest::RunAllTests();
}

Simple example of unit test

And without classes

#include <unittest++/UnitTest++.h>


double multiply(double x, double y) {
    return x * y;
}

TEST(MyMath) {
    CHECK_EQUAL(56, multiply(7,8));
}

int main()
{
    return UnitTest::RunAllTests();
}

For Fortran users, the link at http://sourceforge.net/projects/fortranxunit/ contains a similar software for unit testing.

Iterative methods, Chapter 6

Direct solvers such as Gauss elimination and LU decomposition discussed in connection with project 1.
Iterative solvers such as Basic iterative solvers, Jacobi, Gauss-Seidel, Successive over-relaxation. These methods are easy to parallelize, as we will se later. Much used in solutions of partial differential equations.
Other iterative methods such as Krylov subspace methods with Generalized minimum residual (GMRES) and Conjugate gradient etc will not be discussed.

Iterative methods, Jacobi's method

It is a simple method for solving $$ \hat{A}{\bf x}={\bf b}, $$ where $ \hat{A} $ is a matrix and $ {\bf x} $ and $ {\bf b} $ are vectors. The vector $ {\bf x} $ is the unknown.

It is an iterative scheme where we start with a guess for the unknown, and after $ k+1 $ iterations we have $$ {\bf x}^{(k+1)}= \hat{D}^{-1}({\bf b}-(\hat{L}+\hat{U}){\bf x}^{(k)}), $$ with $ \hat{A}=\hat{D}+\hat{U}+\hat{L} $ and $ \hat{D} $ being a diagonal matrix, $ \hat{U} $ an upper triangular matrix and $ \hat{L} $ a lower triangular matrix.

If the matrix $ \hat{A} $ is positive definite or diagonally dominant, one can show that this method will always converge to the exact solution.

Iterative methods, Jacobi's method

We can demonstrate Jacobi's method by this $ 4\times 4 $ matrix problem. We assume a guess for the vector elements $ x_i^{(0)} $, a guess which represents our first iteration. The new values are obtained by substitution $$ \begin{eqnarray} x_1^{(1)} =&(b_1-a_{12}x_2^{(0)} -a_{13}x_3^{(0)} - a_{14}x_4^{(0)})/a_{11} \nonumber \\ x_2^{(1)} =&(b_2-a_{21}x_1^{(0)} - a_{23}x_3^{(0)} - a_{24}x_4^{(0)})/a_{22} \nonumber \\ x_3^{(1)} =&(b_3- a_{31}x_1^{(0)} -a_{32}x_2^{(0)} -a_{34}x_4^{(0)})/a_{33} \nonumber \\ x_4^{(1)}=&(b_4-a_{41}x_1^{(0)} -a_{42}x_2^{(0)} - a_{43}x_3^{(0)})/a_{44}, \nonumber \end{eqnarray} $$ which after $ k+1 $ iterations reads $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k)} -a_{32}x_2^{(k)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k)} -a_{42}x_2^{(k)} - a_{43}x_3^{(k)})/a_{44}, \nonumber \end{eqnarray} $$

Iterative methods, Jacobi's method

We can generalize the above equations to $$ x_i^{(k+1)}=(b_i-\sum_{j=1, j\ne i}^{n}a_{ij}x_j^{(k)})/a_{ii} $$ or in an even more compact form as $$ {\bf x}^{(k+1)}= \hat{D}^{-1}({\bf b}-(\hat{L}+\hat{U}){\bf x}^{(k)}), $$ with $ \hat{A}=\hat{D}+\hat{U}+\hat{L} $ and $ \hat{D} $ being a diagonal matrix, $ \hat{U} $ an upper triangular matrix and $ \hat{L} $ a lower triangular matrix.

Iterative methods, Gauss-Seidel's method

Our $ 4\times 4 $ matrix problem $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k)} -a_{32}x_2^{(k)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k)} -a_{42}x_2^{(k)} - a_{43}x_3^{(k)})/a_{44}, \nonumber \end{eqnarray} $$ can be rewritten as $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k+1)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k+1)} -a_{32}x_2^{(k+1)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k+1)} -a_{42}x_2^{(k+1)} - a_{43}x_3^{(k+1)})/a_{44}, \nonumber \end{eqnarray} $$ which allows us to utilize the preceding solution (forward substitution). This improves normally the convergence behavior and leads to the Gauss-Seidel method!

Iterative methods, Gauss-Seidel's method

We can generalize $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k+1)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k+1)} -a_{32}x_2^{(k+1)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k+1)} -a_{42}x_2^{(k+1)} - a_{43}x_3^{(k+1)})/a_{44}, \nonumber \end{eqnarray} $$ to the following form $$ x^{(k+1)}_i = \frac{1}{a_{ii}} \left(b_i - \sum_{j > i}a_{ij}x^{(k)}_j - \sum_{j < i}a_{ij}x^{(k+1)}_j \right),\quad i=1,2,\ldots,n. $$ The procedure is generally continued until the changes made by an iteration are below some tolerance.

The convergence properties of the Jacobi method and the Gauss-Seidel method are dependent on the matrix $ \hat{A} $. These methods converge when the matrix is symmetric positive-definite, or is strictly or irreducibly diagonally dominant. Both methods sometimes converge even if these conditions are not satisfied.

Iterative methods, Successive over-relaxation

Given a square system of n linear equations with unknown $ \mathbf x $: $$ \hat{A}\mathbf x = \mathbf b $$ where $$ \hat{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix}, \qquad \mathbf{x} = \begin{bmatrix} x_{1} \\ x_2 \\ \vdots \\ x_n \end{bmatrix} , \qquad \mathbf{b} = \begin{bmatrix} b_{1} \\ b_2 \\ \vdots \\ b_n \end{bmatrix}. $$

Iterative methods, Successive over-relaxation

Then A can be decomposed into a diagonal component D, and strictly lower and upper triangular components L and U: $$ \hat{A} =\hat{D} + \hat{L} + \hat{U}, $$ where $$ D = \begin{bmatrix} a_{11} & 0 & \cdots & 0 \\ 0 & a_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & a_{nn} \end{bmatrix}, \quad L = \begin{bmatrix} 0 & 0 & \cdots & 0 \\ a_{21} & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\a_{n1} & a_{n2} & \cdots & 0 \end{bmatrix}, \quad U = \begin{bmatrix} 0 & a_{12} & \cdots & a_{1n} \\ 0 & 0 & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & 0 \end{bmatrix}. $$ The system of linear equations may be rewritten as: $$ (D+\omega L) \mathbf{x} = \omega \mathbf{b} - [\omega U + (\omega-1) D ] \mathbf{x} $$ for a constant $ \omega > 1 $.

Iterative methods, Successive over-relaxation

The method of successive over-relaxation is an iterative technique that solves the left hand side of this expression for $ x $, using previous value for $ x $ on the right hand side. Analytically, this may be written as: $$ \mathbf{x}^{(k+1)} = (D+\omega L)^{-1} \big(\omega \mathbf{b} - [\omega U + (\omega-1) D ] \mathbf{x}^{(k)}\big). $$ However, by taking advantage of the triangular form of $ (D+\omega L) $, the elements of $ x^{(k+1)} $ can be computed sequentially using forward substitution: $$ x^{(k+1)}_i = (1-\omega)x^{(k)}_i + \frac{\omega}{a_{ii}} \left(b_i - \sum_{j > i} a_{ij}x^{(k)}_j - \sum_{j < i} a_{ij}x^{(k+1)}_j \right),\quad i=1,2,\ldots,n. $$ The choice of relaxation factor is not necessarily easy, and depends upon the properties of the coefficient matrix. For symmetric, positive-definite matrices it can be proven that $ 0 < \omega < 2 $ will lead to convergence, but we are generally interested in faster convergence rather than just convergence.

Cubic Splines, Chapter 6

Cubic spline interpolation is among one of the mostly used methods for interpolating between data points where the arguments are organized as ascending series. In the library program we supply such a function, based on the so-called cubic spline method to be described below.

A spline function consists of polynomial pieces defined on subintervals. The different subintervals are connected via various continuity relations.

Assume we have at our disposal $ n+1 $ points $ x_0, x_1, \dots x_n $ arranged so that $ x_0 < x_1 < x_2 < \dots x_{n-1} < x_n $ (such points are called knots). A spline function $ s $ of degree $ k $ with $ n+1 $ knots is defined as follows

On every subinterval $ [x_{i-1},x_i) $ s is a polynomial of degree $ \le k $.
$ s $ has $ k-1 $ continuous derivatives in the whole interval $ [x_0,x_n] $.

Splines

As an example, consider a spline function of degree $ k=1 $ defined as follows $$ s(x)=\left\{\begin{array}{cc} s_0(x)=a_0x+b_0 & x\in [x_0, x_1) \\ s_1(x)=a_1x+b_1 & x\in [x_1, x_2) \\ \dots & \dots \\ s_{n-1}(x)=a_{n-1}x+b_{n-1} & x\in [x_{n-1}, x_n] \end{array}\right. $$ In this case the polynomial consists of series of straight lines connected to each other at every endpoint. The number of continuous derivatives is then $ k-1=0 $, as expected when we deal with straight lines. Such a polynomial is quite easy to construct given $ n+1 $ points $ x_0, x_1, \dots x_n $ and their corresponding function values.

Splines

The most commonly used spline function is the one with $ k=3 $, the so-called cubic spline function. Assume that we have in adddition to the $ n+1 $ knots a series of functions values $ y_0=f(x_0), y_1=f(x_1), \dots y_n=f(x_n) $. By definition, the polynomials $ s_{i-1} $ and $ s_i $ are thence supposed to interpolate the same point $ i $, that is $$ s_{i-1}(x_i)= y_i = s_i(x_i), $$ with $ 1 \le i \le n-1 $. In total we have $ n $ polynomials of the type $$ s_i(x)=a_{i0}+a_{i1}x+a_{i2}x^2+a_{i2}x^3, $$ yielding $ 4n $ coefficients to determine.

Splines

Every subinterval provides in addition the $ 2n $ conditions $$ y_i = s(x_i), $$ and $$ s(x_{i+1})= y_{i+1}, $$ to be fulfilled. If we also assume that $ s' $ and $ s'' $ are continuous, then $$ s'_{i-1}(x_i)= s'_i(x_i), $$ yields $ n-1 $ conditions. Similarly, $$ s''_{i-1}(x_i)= s''_i(x_i), $$ results in additional $ n-1 $ conditions. In total we have $ 4n $ coefficients and $ 4n-2 $ equations to determine them, leaving us with $ 2 $ degrees of freedom to be determined.

Splines

Using the last equation we define two values for the second derivative, namely $$ s''_{i}(x_i)= f_i, $$ and $$ s''_{i}(x_{i+1})= f_{i+1}, $$ and setting up a straight line between $ f_i $ and $ f_{i+1} $ we have $$ s_i''(x) = \frac{f_i}{x_{i+1}-x_i}(x_{i+1}-x)+ \frac{f_{i+1}}{x_{i+1}-x_i}(x-x_i), $$ and integrating twice one obtains $$ s_i(x) = \frac{f_i}{6(x_{i+1}-x_i)}(x_{i+1}-x)^3+ \frac{f_{i+1}}{6(x_{i+1}-x_i)}(x-x_i)^3 +c(x-x_i)+d(x_{i+1}-x). $$

Splines

Using the conditions $ s_i(x_i)=y_i $ and $ s_i(x_{i+1})=y_{i+1} $ we can in turn determine the constants $ c $ and $ d $ resulting in $$ \begin{eqnarray} s_i(x) =&\frac{f_i}{6(x_{i+1}-x_i)}(x_{i+1}-x)^3+ \frac{f_{i+1}}{6(x_{i+1}-x_i)}(x-x_i)^3 \nonumber \\ +&(\frac{y_{i+1}}{x_{i+1}-x_i}-\frac{f_{i+1}(x_{i+1}-x_i)}{6}) (x-x_i)+ (\frac{y_{i}}{x_{i+1}-x_i}-\frac{f_{i}(x_{i+1}-x_i)}{6}) (x_{i+1}-x). \end{eqnarray} $$

Splines

How to determine the values of the second derivatives $ f_{i} $ and $ f_{i+1} $? We use the continuity assumption of the first derivatives $$ s'_{i-1}(x_i)= s'_i(x_i), $$ and set $ x=x_i $. Defining $ h_i=x_{i+1}-x_i $ we obtain finally the following expression $$ h_{i-1}f_{i-1}+2(h_{i}+h_{i-1})f_i+h_if_{i+1}= \frac{6}{h_i}(y_{i+1}-y_i)-\frac{6}{h_{i-1}}(y_{i}-y_{i-1}), $$ and introducing the shorthands $ u_i=2(h_{i}+h_{i-1}) $, $ v_i=\frac{6}{h_i}(y_{i+1}-y_i)-\frac{6}{h_{i-1}}(y_{i}-y_{i-1}) $, we can reformulate the problem as a set of linear equations to be solved through e.g., Gaussian elemination

Splines

Gaussian elimination $$ \left[\begin{array}{cccccccc} u_1 & h_1 &0 &\dots & & & & \\ h_1 & u_2 & h_2 &0 &\dots & & & \\ 0 & h_2 & u_3 & h_3 &0 &\dots & & \\ \dots& & \dots &\dots &\dots &\dots &\dots & \\ &\dots & & &0 &h_{n-3} &u_{n-2} &h_{n-2} \\ & && & &0 &h_{n-2} &u_{n-1} \end{array}\right] \left[\begin{array}{c} f_1 \\ f_2 \\ f_3\\ \dots \\ f_{n-2} \\ f_{n-1} \end{array} \right] = \left[\begin{array}{c} v_1 \\ v_2 \\ v_3\\ \dots \\ v_{n-2}\\ v_{n-1} \end{array} \right]. $$ Note that this is a set of tridiagonal equations and can be solved through only $ O(n) $ operations.

Splines

The functions supplied in the program library are spline and splint. In order to use cubic spline interpolation you need first to call

spline(double x[], double y[], int n, double yp1,  double yp2, double y2[])

This function takes as input $ x[0,..,n - 1] $ and $ y[0,..,n - 1] $ containing a tabulation $ y_i = f(x_i) $ with $ x_0 < x_1 < .. < x_{n - 1} $ together with the first derivatives of $ f(x) $ at $ x_0 $ and $ x_{n-1} $, respectively. Then the function returns $ y2[0,..,n-1] $ which contains the second derivatives of $ f(x_i) $ at each point $ x_i $. $ n $ is the number of points. This function provides the cubic spline interpolation for all subintervals and is called only once.

Splines

Thereafter, if you wish to make various interpolations, you need to call the function

splint(double x[], double y[], double y2a[], int n, double x, double *y)

which takes as input the tabulated values $ x[0,..,n - 1] $ and $ y[0,..,n - 1] $ and the output y2a[0,..,n - 1] from spline. It returns the value $ y $ corresponding to the point $ x $.

Overview of Week 39

Differential equations program, week 39

Ordinary differential equations, Runge-Kutta method,chapter 8
Ordinary differential equations with boundary conditions: one-variable equations to be solved by shooting and Green's function methods, chapter 9
We can solve such equations by a finite difference scheme as well, turning the equation into an eigenvalue problem. Still one variable. Done in projects 1 and 2.
If we have more than one variable, we need to solve partial differential equations, see Chapter 10

The material on differential equations is covered by chapters 8, 9 and 10. Project 3 deals with ordinary differential equations (the solar system). Reading assignment for week 39 is chapter 8 of lecture notes.

Differential Equations, chapter 8

The order of the ODE refers to the order of the derivative on the left-hand side in the equation $$ \begin{equation} \frac{dy}{dt}=f(t,y). \end{equation} $$ This equation is of first order and $ f $ is an arbitrary function. A second-order equation goes typically like $$ \begin{equation} \frac{d^2y}{dt^2}=f(t,\frac{dy}{dt},y). \end{equation} $$ A well-known second-order equation is Newton's second law $$ \begin{equation} m\frac{d^2x}{dt^2}=-kx, \label{eq:newton} \end{equation} $$ where $ k $ is the force constant. ODE depend only on one variable

Differential Equations

partial differential equations like the time-dependent Schr\"odinger equation $$ \begin{equation} i\hbar\frac{\partial \psi({\bf x},t)}{\partial t}= -\frac{\hbar^2}{2m}\left( \frac{\partial^2 \psi({\bf r},t)}{\partial x^2} + \frac{\partial^2 \psi({\bf r},t)}{\partial y^2}+ \frac{\partial^2 \psi({\bf r},t)}{\partial z^2}\right) + V({\bf x})\psi({\bf x},t), \end{equation} $$ may depend on several variables. In certain cases, like the above equation, the wave function can be factorized in functions of the separate variables, so that the Schroedinger equation can be rewritten in terms of sets of ordinary differential equations. These equations are discussed in chapter 10. Involve boundary conditions in addition to initial conditions.

Differential Equations

We distinguish also between linear and non-linear differential equation where for example $$ \begin{equation} \frac{dy}{dt}=g^3(t)y(t), \end{equation} $$ is an example of a linear equation, while $$ \begin{equation} \frac{dy}{dt}=g^3(t)y(t)-g(t)y^2(t), \end{equation} $$ is a non-linear ODE.

Differential Equations

Another concept which dictates the numerical method chosen for solving an ODE, is that of initial and boundary conditions. To give an example, if we study white dwarf stars or neutron stars we will need to solve two coupled first-order differential equations, one for the total mass $ m $ and one for the pressure $ P $ as functions of $ \rho $ $$ \frac{dm}{dr}=4\pi r^{2}\rho (r)/c^2, $$ and $$ \frac{dP}{dr}=-\frac{Gm(r)}{r^{2}}\rho (r)/c^2. $$ where $ \rho $ is the mass-energy density. The initial conditions are dictated by the mass being zero at the center of the star, i.e., when $ r=0 $, yielding $ m(r=0)=0 $. The other condition is that the pressure vanishes at the surface of the star.

In the solution of the Schroedinger equation for a particle in a potential, we may need to apply boundary conditions as well, such as demanding continuity of the wave function and its derivative.

Differential Equations

In many cases it is possible to rewrite a second-order differential equation in terms of two first-order differential equations. Consider again the case of Newton's second law in Eq. \eqref{eq:newton}. If we define the position $ x(t)=y^{(1)}(t) $ and the velocity $ v(t)=y^{(2)}(t) $ as its derivative $$ \begin{equation} \frac{dy^{(1)}(t)}{dt}=\frac{dx(t)}{dt}=y^{(2)}(t), \end{equation} $$ we can rewrite Newton's second law as two coupled first-order differential equations $$ \begin{equation} m\frac{dy^{(2)}(t)}{dt}=-kx(t)=-ky^{(1)}(t), \label{eq:n1} \end{equation} $$ and $$ \begin{equation} \frac{dy^{(1)}(t)}{dt}=y^{(2)}(t). \label{eq:n2} \end{equation} $$

Differential Equations, Finite Difference

These methods fall under the general class of one-step methods. The algoritm is rather simple. Suppose we have an initial value for the function $ y(t) $ given by $$ \begin{equation} y_0=y(t=t_0). \end{equation} $$ We are interested in solving a differential equation in a region in space $ [a,b] $. We define a step $ h $ by splitting the interval in $ N $ sub intervals, so that we have $$ \begin{equation} h=\frac{b-a}{N}. \end{equation} $$ With this step and the derivative of $ y $ we can construct the next value of the function $ y $ at $$ \begin{equation} y_1=y(t_1=t_0+h), \end{equation} $$ and so forth.

Differential Equations

If the function is rather well-behaved in the domain $ [a,b] $, we can use a fixed step size. If not, adaptive steps may be needed. Here we concentrate on fixed-step methods only. Let us try to generalize the above procedure by writing the step $ y_{i+1} $ in terms of the previous step $ y_i $ $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) + h\Delta(t_i,y_i(t_i)) + O(h^{p+1}), \end{equation} $$ where $ O(h^{p+1}) $ represents the truncation error. To determine $ \Delta $, we Taylor expand our function $ y $ $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) + h(y'(t_i)+\dots +y^{(p)}(t_i)\frac{h^{p-1}}{p!}) + O(h^{p+1}), \label{eq:taylor} \end{equation} $$ where we will associate the derivatives in the parenthesis with $$ \begin{equation} \Delta(t_i,y_i(t_i))=(y'(t_i)+\dots +y^{(p)}(t_i)\frac{h^{p-1}}{p!}). \label{eq:delta} \end{equation} $$

Differential Equations

We define $$ \begin{equation} y'(t_i)=f(t_i,y_i) \end{equation} $$ and if we truncate $ \Delta $ at the first derivative, we have $$ \begin{equation} y_{i+1}=y(t_i) + hf(t_i,y_i) + O(h^2), \label{eq:euler} \end{equation} $$ which when complemented with $ t_{i+1}=t_i+h $ forms the algorithm for the well-known Euler method. Note that at every step we make an approximation error of the order of $ O(h^2) $, however the total error is the sum over all steps $ N=(b-a)/h $, yielding thus a global error which goes like $ NO(h^2)\approx O(h) $.

Differential Equations

To make Euler's method more precise we can obviously decrease $ h $ (increase $ N $). However, if we are computing the derivative $ f $ numerically by for example the two-steps formula $$ f'_{2c}(x)= \frac{f(x+h)-f(x)}{h}+O(h), $$ we can enter into roundoff error problems when we subtract two almost equal numbers $ f(x+h)-f(x)\approx 0 $. Euler's method is not recommended for precision calculation, although it is handy to use in order to get a first view on how a solution may look like. As an example, consider Newton's equation rewritten in Eqs. \eqref{eq:n1} and \eqref{eq:n2}. We define $ y_0=y^{(1)}(t=0) $ an $ v_0=y^{(2)}(t=0) $. The first steps in Newton's equations are then $$ \begin{equation} y^{(1)}_1=y_0+hv_0+O(h^2) \end{equation} $$ and $$ \begin{equation} y^{(2)}_1=v_0-hy_0k/m+O(h^2). \end{equation} $$

Differential Equations

The Euler method is asymmetric in time, since it uses information about the derivative at the beginning of the time interval. This means that we evaluate the position at $ y^{(1)}_1 $ using the velocity at $ y^{(2)}_0=v_0 $. A simple variation is to determine $ y^{(1)}_{n+1} $ using the velocity at $ y^{(2)}_{n+1} $, that is (in a slightly more generalized form) $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+h y^{(2)}_{n+1}+O(h^2) \end{equation} $$ and $$ \begin{equation} y^{(2)}_{n+1}=y^{(2)}_{n}+h a_{n}+O(h^2). \end{equation} $$ The acceleration $ a_n $ is a function of $ a_n(y^{(1)}_{n}, y^{(2)}_{n},t) $ and needs to be evaluated as well. This is the Euler-Cromer method.

Differential Equations

Let us then include the second derivative in our Taylor expansion. We have then $$ \begin{equation} \Delta(t_i,y_i(t_i))=f(t_i)+\frac{h}{2}\frac{df(t_i,y_i)}{dt}+O(h^3). \end{equation} $$ The second derivative can be rewritten as $$ \begin{equation} y''=f'=\frac{df}{dt}=\frac{\partial f}{\partial t}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial t}=\frac{\partial f}{\partial t}+\frac{\partial f}{\partial y}f \label{eq:derivatives} \end{equation} $$ and we can rewrite Eq.\ \eqref{eq:taylor} as $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) +hf(t_i)+ \frac{h^2}{2}\left(\frac{\partial f}{\partial t}+\frac{\partial f}{\partial y}f\right) + O(h^{3 }), \end{equation} $$ which has a local approximation error $ O(h^{3 }) $ and a global error $ O(h^{2}) $.

Differential Equations

These approximations can be generalized by using the derivative $ f $ to arbitrary order so that we have $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) + h(f(t_i,y_i)+\dots f^{(p-1)}(t_i,y_i) \frac{h^{p-1}}{p!}) + O(h^{p+1}). \end{equation} $$ These methods, based on higher-order derivatives, are in general not used in numerical computation, since they rely on evaluating derivatives several times. Unless one has analytical expressions for these, the risk of roundoff errors is large.

Differential Equations

The most obvious improvements to Euler's and Euler-Cromer's algorithms, avoiding in addition the need for computing a second derivative, is the so-called midpoint method. We have then $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+\frac{h}{2}\left(y^{(2)}_{n+1}+y^{(2)}_{n}\right)+O(h^2) \end{equation} $$ and $$ \begin{equation} y^{(2)}_{n+1}=y^{(2)}_{n}+h a_{n}+O(h^2), \end{equation} $$ yielding $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+hy^{(2)}_{n}+\frac{h^2}{2}a_n+O(h^3) \end{equation} $$ implying that the local truncation error in the position is now $ O(h^3) $, whereas Euler's or Euler-Cromer's methods have a local error of $ O(h^2) $.

Differential Equations

Thus, the midpoint method yields a global error with second-order accuracy for the position and first-order accuracy for the velocity. However, although these methods yield exact results for constant accelerations, the error increases in general with each time step.

One method that avoids this is the so-called half-step method. Here we define $$ \begin{equation} y^{(2)}_{n+1/2}=y^{(2)}_{n-1/2}+h a_{n}+O(h^2), \end{equation} $$ and $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+hy^{(2)}_{n+1/2} +O(h^2). \end{equation} $$ Note that this method needs the calculation of $ y^{(2)}_{1/2} $. This is done using e.g., Euler's method $$ \begin{equation} y^{(2)}_{1/2}=y^{(2)}_{0}+h a_{0}+O(h^2). \end{equation} $$ As this method is numerically stable, it is often used instead of Euler's method.

Differential Equations

Another method which one may encounter is the Euler-Richardson method with $$ \begin{equation} y^{(2)}_{n+1}=y^{(2)}_{n}+h a_{n+1/2}+O(h^2), \label{eq:er1} \end{equation} $$ and $$ \begin{equation} \label{eq:er2} y^{(1)}_{n+1}=y^{(1)}_{n}+hy^{(2)}_{n+1/2} +O(h^2). \end{equation} $$ The program program2.cpp includes all of the above methods.

Differential Equations, Runge-Kutta methods

Runge-Kutta (RK) methods are based on Taylor expansion formulae, but yield in general better algorithms for solutions of an ODE. The basic philosophy is that it provides an intermediate step in the computation of $ y_{i+1} $.

To see this, consider first the following definitions $$ \begin{equation} \frac{dy}{dt}=f(t,y), \end{equation} $$ and $$ \begin{equation} y(t)=\int f(t,y) dt, \end{equation} $$ and $$ \begin{equation} y_{i+1}=y_i+ \int_{t_i}^{t_{i+1}} f(t,y) dt. \end{equation} $$

Differential Equations, Runge-Kutta methods

To demonstrate the philosophy behind RK methods, let us consider the second-order RK method, RK2. The first approximation consists in Taylor expanding $ f(t,y) $ around the center of the integration interval $ t_i $ to $ t_{i+1} $, that is, at $ t_i+h/2 $, $ h $ being the step. Using the midpoint formula for an integral, defining $ y(t_i+h/2) = y_{i+1/2} $ and $ t_i+h/2 = t_{i+1/2} $, we obtain $$ \begin{equation} \int_{t_i}^{t_{i+1}} f(t,y) dt \approx hf(t_{i+1/2},y_{i+1/2}) +O(h^3). \end{equation} $$ This means in turn that we have $$ \begin{equation} y_{i+1}=y_i + hf(t_{i+1/2},y_{i+1/2}) +O(h^3). \end{equation} $$

Differential Equations, Runge-Kutta methods

However, we do not know the value of $ y_{i+1/2} $. Here comes thus the next approximation, namely, we use Euler's method to approximate $ y_{i+1/2} $. We have then $$ \begin{equation} y_{(i+1/2)}=y_i + \frac{h}{2}\frac{dy}{dt} = y(t_i) + \frac{h}{2}f(t_i,y_i). \end{equation} $$ This means that we can define the following algorithm for the second-order Runge-Kutta method, RK2. $$ \begin{equation} k_1=hf(t_i,y_i), \end{equation} $$ $$ \begin{equation} k_2=hf(t_{i+1/2},y_i+k_1/2), \end{equation} $$ with the final value $$ \begin{equation} y_{i+i}\approx y_i + k_2 +O(h^3). \end{equation} $$

The difference between the previous one-step methods is that we now need an intermediate step in our evaluation, namely $ t_i+h/2 = t_{(i+1/2)} $ where we evaluate the derivative $ f $. This involves more operations, but the gain is a better stability in the solution.

Differential Equations, Runge-Kutta methods

The fourth-order Runge-Kutta, RK4, which we will employ in the solution of various differential equations below, has the following algorithm $$ \begin{equation} k_1=hf(t_i,y_i), \end{equation} $$ $$ \begin{equation} k_2=hf(t_i+h/2,y_i+k_1/2), \end{equation} $$ $$ \begin{equation} k_3=hf(t_i+h/2,y_i+k_2/2) \end{equation} $$ $$ \begin{equation} k_4=hf(t_i+h,y_i+k_3) \end{equation} $$ with the final value $$ \begin{equation} y_{i+1}=y_i +\frac{1}{6}\left( k_1 +2k_2+2k_3+k_4\right). \end{equation} $$ Thus, the algorithm consists in first calculating $ k_1 $ with $ t_i $, $ y_1 $ and $ f $ as inputs. Thereafter, we increase the step size by $ h/2 $ and calculate $ k_2 $, then $ k_3 $ and finally $ k_4 $. Global error as $ O(h^4) $.

Simple Example, Block tied to a Wall

Our first example is the classical case of simple harmonic oscillations, namely a block sliding on a horizontal frictionless surface. The block is tied to a wall with a spring. If the spring is not compressed or stretched too far, the force on the block at a given position $ x $ is $$ F=-kx. $$ The negative sign means that the force acts to restore the object to an equilibrium position. Newton's equation of motion for this idealized system is then $$ m\frac{d^2x}{dt^2}=-kx, $$ or we could rephrase it as $$ \frac{d^2x}{dt^2}=-\frac{k}{m}x=-\omega_0^2x, \label{eq:newton1} $$ with the angular frequency $ \omega_0^2=k/m $.

The above differential equation has the advantage that it can be solved analytically with solutions on the form $$ x(t)=Acos(\omega_0t+\nu), $$ where $ A $ is the amplitude and $ \nu $ the phase constant. This provides in turn an important test for the numerical solution and the development of a program for more complicated cases which cannot be solved analytically.

Simple Example, Block tied to a Wall

With the position $ x(t) $ and the velocity $ v(t)=dx/dt $ we can reformulate Newton's equation in the following way $$ \frac{dx(t)}{dt}=v(t), $$ and $$ \frac{dv(t)}{dt}=-\omega_0^2x(t). $$ We are now going to solve these equations using the Runge-Kutta method to fourth order discussed previously.

Simple Example, Block tied to a Wall

Before proceeding however, it is important to note that in addition to the exact solution, we have at least two further tests which can be used to check our solution.

Since functions like $ cos $ are periodic with a period $ 2\pi $, then the solution $ x(t) $ has also to be periodic. This means that $$ x(t+T)=x(t), $$ with $ T $ the period defined as $$ T=\frac{2\pi}{\omega_0}=\frac{2\pi}{\sqrt{k/m}}. $$ Observe that $ T $ depends only on $ k/m $ and not on the amplitude of the solution.

Simple Example, Block tied to a Wall

In addition to the periodicity test, the total energy has also to be conserved.

Suppose we choose the initial conditions $$ x(t=0)=1\hspace{0.1cm} \mathrm{m}\hspace{1cm} v(t=0)=0\hspace{0.1cm}\mathrm{m/s}, $$ meaning that block is at rest at $ t=0 $ but with a potential energy $$ E_0=\frac{1}{2}kx(t=0)^2=\frac{1}{2}k. $$ The total energy at any time $ t $ has however to be conserved, meaning that our solution has to fulfil the condition $$ E_0=\frac{1}{2}kx(t)^2+\frac{1}{2}mv(t)^2. $$

Simple Example, Block tied to a Wall

An algorithm which implements these equations is included below.

Choose the initial position and speed, with the most common choice $ v(t=0)=0 $ and some fixed value for the position.
Choose the method you wish to employ in solving the problem.
Subdivide the time interval $ [t_i,t_f] $ into a grid with step size

$$ h=\frac{t_f-t_i}{N}, $$ where $ N $ is the number of mesh points.

Calculate now the total energy given by

$$ E_0=\frac{1}{2}kx(t=0)^2=\frac{1}{2}k. $$

The Runge-Kutta method is used to obtain $ x_{i+1} $ and $ v_{i+1} $ starting from the previous values $ x_i $ and $ v_i $.
When we have computed $ x(v)_{i+1} $ we upgrade $ t_{i+1}=t_i+h $.
This iterative process continues till we reach the maximum time $ t_f $.
The results are checked against the exact solution. Furthermore, one has to check the stability of the numerical solution against the chosen number of mesh points $ N $.

Simple Example, Block tied to a Wall

/*    This program solves Newton's equation for a block
      sliding on a horizontal frictionless surface. The block
      is tied  to a wall with a spring, and Newton's equation
      takes the form
           m d^2x/dt^2 =-kx
      with k the spring tension and m the mass of the block.
      The angular frequency is omega^2 = k/m and we set it equal
      1 in this example program. 

      Newton's equation is rewritten as two coupled differential
      equations, one for the position x  and one for the velocity v
           dx/dt = v    and
           dv/dt = -x   when we set k/m=1

      We use therefore a two-dimensional array to represent x and v
      as functions of t
      y[0] == x
      y[1] == v
      dy[0]/dt = v
      dy[1]/dt = -x

      The derivatives are calculated by the user defined function 
      derivatives.

      The user has to specify the initial velocity (usually v_0=0)
      the number of steps and the initial position. In the programme
      below we fix the time interval [a,b] to [0,2*pi].

*/ 
#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
//#include "lib.h"
using namespace  std;
// output file as global variable
ofstream ofile;
// function declarations
void derivatives(double, double *, double *);
void initialise ( double&, double&, int&);
void output( double, double *, double);
void runge_kutta_4(double *, double *, int, double, double, 
                   double *, void (*)(double, double *, double *));

int main(int argc, char* argv[])
{
//  declarations of variables
  double *y, *dydt, *yout, t, h, tmax, E0;
  double initial_x, initial_v;
  int i, number_of_steps, n;
  char *outfilename;
  // Read in output file, abort if there are too few command-line arguments
  if( argc <= 1 ){
    cout << "Bad Usage: " << argv[0] <<
      " read also output file on same line" << endl;
    //    exit(1);
  }
  else{
    outfilename=argv[1];
  }
  ofile.open(outfilename);
  //  this is the number of differential equations  
  n = 2;     
  //  allocate space in memory for the arrays containing the derivatives 
  dydt = new double[n];
  y = new double[n];
  yout = new double[n];
  // read in the initial position, velocity and number of steps 
  initialise (initial_x, initial_v, number_of_steps);
  //  setting initial values, step size and max time tmax  
  h = 4.*acos(-1.)/( (double) number_of_steps);   // the step size     
  tmax = h*number_of_steps;               // the final time    
  y[0] = initial_x;                       // initial position  
  y[1] = initial_v;                       // initial velocity  
  t=0.;                                   // initial time      
  E0 = 0.5*y[0]*y[0]+0.5*y[1]*y[1];       // the initial total energy
  // now we start solving the differential equations using the RK4 method 
  while (t <= tmax){
    derivatives(t, y, dydt);   // initial derivatives              
    runge_kutta_4(y, dydt, n, t, h, yout, derivatives); 
    for (i = 0; i < n; i++) {
      y[i] = yout[i];  
    }
    t += h;
    output(t, y, E0);   // write to file 
  }
  delete [] y; delete [] dydt; delete [] yout;
  ofile.close();  // close output file
  return 0;
}   //  End of main function 

//     Read in from screen the number of steps,
//     initial position and initial speed 
void initialise (double& initial_x, double& initial_v, int& number_of_steps)
{
 cout << "Initial position = ";
 cin >> initial_x;
 cout << "Initial speed = ";
 cin >> initial_v;
 cout << "Number of steps = ";
 cin >> number_of_steps;
}  // end of function initialise  

//   this function sets up the derivatives for this special case  
void derivatives(double t, double *y, double *dydt)
{
  dydt[0]=y[1];    // derivative of x 
  dydt[1]=-y[0]; // derivative of v 
} // end of function derivatives  

//    function to write out the final results
void output(double t, double *y, double E0)
{
  ofile << setiosflags(ios::showpoint | ios::uppercase);
  ofile << setw(15) << setprecision(8) << t;
  ofile << setw(15) << setprecision(8) << y[0];
  ofile << setw(15) << setprecision(8) << y[1];
  ofile << setw(15) << setprecision(8) << cos(t);
  ofile << setw(15) << setprecision(8) << 
    0.5*y[0]*y[0]+0.5*y[1]*y[1]-E0 << endl;
}  // end of function output

/*   This function upgrades a function y (input as a pointer)
     and returns the result yout, also as a pointer. Note that
     these variables are declared as arrays.  It also receives as
     input the starting value for the derivatives in the pointer
     dydx. It receives also the variable n which represents the 
     number of differential equations, the step size h and 
     the initial value of x. It receives also the name of the
     function *derivs where the given derivative is computed
*/
void runge_kutta_4(double *y, double *dydx, int n, double x, double h, 
                   double *yout, void (*derivs)(double, double *, double *))
{
  int i;
  double      xh,hh,h6; 
  double *dym, *dyt, *yt;
  //   allocate space for local vectors   
  dym = new double [n];
  dyt =  new double [n];
  yt =  new double [n];
  hh = h*0.5;
  h6 = h/6.;
  xh = x+hh;
  for (i = 0; i < n; i++) {
    yt[i] = y[i]+hh*dydx[i];
  }
  (*derivs)(xh,yt,dyt);     // computation of k2, eq. 3.60   
  for (i = 0; i < n; i++) {
    yt[i] = y[i]+hh*dyt[i];
  }
  (*derivs)(xh,yt,dym); //  computation of k3, eq. 3.61   
  for (i=0; i < n; i++) {
    yt[i] = y[i]+h*dym[i];
    dym[i] += dyt[i];
  }
  (*derivs)(x+h,yt,dyt);    // computation of k4, eq. 3.62   
  //      now we upgrade y in the array yout  
  for (i = 0; i < n; i++){
    yout[i] = y[i]+h6*(dydx[i]+dyt[i]+2.0*dym[i]);
  }
  delete []dym;
  delete [] dyt;
  delete [] yt;
}       //  end of function Runge-kutta 4

The classical pendulum

The angular equation of motion of the pendulum is given by Newton's equation and with no external force it reads $$ \begin{equation} ml\frac{d^2\theta}{dt^2}+mgsin(\theta)=0, \end{equation} $$ with an angular velocity and acceleration given by $$ \begin{equation} v=l\frac{d\theta}{dt}, \end{equation} $$ and $$ \begin{equation} a=l\frac{d^2\theta}{dt^2}. \end{equation} $$

More on the Pendulum

We do however expect that the motion will gradually come to an end due a viscous drag torque acting on the pendulum. In the presence of the drag, the above equation becomes $$ \begin{equation} ml\frac{d^2\theta}{dt^2}+\nu\frac{d\theta}{dt} +mgsin(\theta)=0, \label{eq:pend1} \end{equation} $$ where $ \nu $ is now a positive constant parameterizing the viscosity of the medium in question. In order to maintain the motion against viscosity, it is necessary to add some external driving force. We choose here a periodic driving force. The last equation becomes then $$ \begin{equation} ml\frac{d^2\theta}{dt^2}+\nu\frac{d\theta}{dt} +mgsin(\theta)=Asin(\omega t), \label{eq:pend2} \end{equation} $$ with $ A $ and $ \omega $ two constants representing the amplitude and the angular frequency respectively. The latter is called the driving frequency.

More on the Pendulum

We define $$ \omega_0=\sqrt{g/l}, $$ the so-called natural frequency and the new dimensionless quantities $$ \hat{t}=\omega_0t, $$ with the dimensionless driving frequency $$ \hat{\omega}=\frac{\omega}{\omega_0}, $$ and introducing the quantity $ Q $, called the quality factor, $$ Q=\frac{mg}{\omega_0\nu}, $$ and the dimensionless amplitude $$ \hat{A}=\frac{A}{mg} $$

More on the Pendulum

We have $$ \frac{d^2\theta}{d\hat{t}^2}+\frac{1}{Q}\frac{d\theta}{d\hat{t}} +sin(\theta)=\hat{A}cos(\hat{\omega}\hat{t}). $$ This equation can in turn be recast in terms of two coupled first-order differential equations as follows $$ \frac{d\theta}{d\hat{t}}=\hat{v}, $$ and $$ \frac{d\hat{v}}{d\hat{t}}=-\frac{\hat{v}}{Q}-sin(\theta)+\hat{A}cos(\hat{\omega}\hat{t}). $$ These are the equations to be solved. The factor $ Q $ represents the number of oscillations of the undriven system that must occur before its energy is significantly reduced due to the viscous drag. The amplitude $ \hat{A} $ is measured in units of the maximum possible gravitational torque while $ \hat{\omega} $ is the angular frequency of the external torque measured in units of the pendulum's natural frequency.

Classes for ODE methods

It can be very useful to make a Class which contains all possible methods discussed. In Fortran we can use the MODULE keyword in order to can methods and keep the variables private and hidden from other parts of our code. This allows for a generalization which can be used to tackle other ODEs as well. In program2.cpp of chapter 8 we have canned the following methods

void euler();
void euler_cromer();
void midpoint();
void euler_richardson();
void half_step();
void rk2(); //runge-kutta-second-order
void rk4_step(double,double*,double*,double); // we need it in function rk4() and asc()
void rk4(); //runge-kutta-fourth-order
void asc(); //runge-kutta-fourth-order with adaptive stepsize control

Classes for ODE methods

#include <stdio.h>
#include <iostream>
#include <cmath>
#include <fstream>
#include <iomanip>
using namespace  std;
/*

Different methods for solving ODEs are presented
We are solving the following eqation:

m*l*(phi)'' + reib*(phi)' + m*g*sin(phi) = A*cos(omega*t)

If you want to solve similar equations with other values you have to
rewrite the methods 'derivatives' and 'initialise' and change the variables in the private
part of the class Pendel

At first we rewrite the equation using the following definitions:

omega_0 = sqrt(g*l)
t_roof = omega_0*t
omega_roof = omega/omega_0
Q = (m*g)/(omega_0*reib)
A_roof = A/(m*g)

and we get a dimensionless equation

(phi)'' + 1/Q*(phi)' + sin(phi) = A_roof*cos(omega_roof*t_roof)

This equation can be written as two equations of first order:

(phi)' = v
(v)' = -v/Q - sin(phi) +A_roof*cos(omega_roof*t_roof)

All numerical methods are applied to the last two equations.
*/

class pendelum
 {
 private:
   double Q, A_roof, omega_0, omega_roof,g; //
   double y[2];          //for the initial-values of phi and v
   int n;                // how many steps
   double delta_t,delta_t_roof;

 public:
   void derivatives(double,double*,double*);
   void initialise();
   void euler();
   void euler_cromer();
   void midpoint();
   void euler_richardson();
   void half_step();
   void rk2(); //runge-kutta-second-order
   void rk4_step(double,double*,double*,double); // we need it in function rk4() and asc()
   void rk4(); //runge-kutta-fourth-order
   void asc(); //runge-kutta-fourth-order with adaptive stepsize control
 };

void pendelum::derivatives(double t, double* in, double* out)
{ /* Here we are calculating the derivatives at (dimensionless) time t
     'in' are the values of phi and v, which are used for the calculation
     The results are given to 'out' */
  
  out[0]=in[1];             //out[0] = (phi)'  = v
  if(Q)
    out[1]=-in[1]/((double)Q)-sin(in[0])+A_roof*cos(omega_roof*t);  //out[1] = (phi)''
  else
    out[1]=-sin(in[0])+A_roof*cos(omega_roof*t);  //out[1] = (phi)''
}


void pendelum::initialise()
{
  double m,l,omega,A,viscosity,phi_0,v_0,t_end;
  cout<<"Solving the differential eqation of the pendulum!\n";
  cout<<"We have a pendulum with mass m, length l. Then we have a periodic force with amplitude A and omega\n";
  cout<<"Furthermore there is a viscous drag coefficient.\n";
  cout<<"The initial conditions at t=0 are phi_0 and v_0\n";
  cout<<"Mass m: ";
  cin>>m;
  cout<<"length l: ";
  cin>>l;
  cout<<"omega of the force: ";
  cin>>omega;
  cout<<"amplitude of the force: ";
  cin>>A;
  cout<<"The value of the viscous drag constant (viscosity): ";
  cin>>viscosity;
  cout<<"phi_0: ";
  cin>>y[0];
  cout<<"v_0: ";
  cin>>y[1];
  cout<<"Number of time steps or integration steps:";
  cin>>n;
  cout<<"Final time steps as multiplum of pi:";
  cin>>t_end;
  t_end *= acos(-1.);
  g=9.81;
  // We need the following values:
  omega_0=sqrt(g/((double)l));      // omega of the pendulum
  if (viscosity)  Q= m*g/((double)omega_0*viscosity);
  else Q=0; //calculating Q
  A_roof=A/((double)m*g);
  omega_roof=omega/((double)omega_0);
  delta_t_roof=omega_0*t_end/((double)n);    //delta_t without dimension
  delta_t=t_end/((double)n);
}

void pendelum::euler()
{ //using simple euler-method
  int i;
  double yout[2],y_h[2];
  double t_h;

  y_h[0]=y[0];
  y_h[1]=y[1];
  t_h=0;
  ofstream fout("euler.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  for(i=1;i<=n;i++){
    derivatives(t_h,y_h,yout);
    yout[1]=y_h[1]+yout[1]*delta_t_roof;
    yout[0]=y_h[0]+yout[0]*delta_t_roof;
    // Calculation with dimensionless values    
    fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    t_h+=delta_t_roof;
    y_h[1]=yout[1];
    y_h[0]=yout[0];
  }
  fout.close;
}

void pendelum::euler_cromer()
{
  int i;
  double t_h;
  double yout[2],y_h[2];

  t_h=0;
  y_h[0]=y[0];  //phi
  y_h[1]=y[1];  //v
  ofstream fout("ec.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  for(i=1; i<=n; i++){
    derivatives(t_h,y_h,yout);
    yout[1]=y_h[1]+yout[1]*delta_t_roof;
    yout[0]=y_h[0]+yout[1]*delta_t_roof;
    // The new calculated value of v is used for calculating phi 
    fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    t_h+=delta_t_roof;
    y_h[0]=yout[0];
    y_h[1]=yout[1];
  }
  fout.close;
}

void pendelum::midpoint()
{
  int i;
  double t_h;
  double yout[2],y_h[2];
  
  t_h=0;
  y_h[0]=y[0];  //phi
  y_h[1]=y[1];  //v
  ofstream fout("midpoint.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  for(i=1; i<=n; i++){
    derivatives(t_h,y_h,yout);
    yout[1]=y_h[1]+yout[1]*delta_t_roof;
    yout[0]=y_h[0]+0.5*(yout[1]+y_h[1])*delta_t_roof;
    fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    t_h+=delta_t_roof;
    y_h[0]=yout[0];
    y_h[1]=yout[1];
  }
  fout.close;
}


void pendelum::euler_richardson()
{
  int i;
  double t_h,t_m;
  double yout[2],y_h[2],y_m[2];

  t_h=0;
  y_h[0]=y[0];  //phi
  y_h[1]=y[1];  //v
  ofstream fout("er.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  for(i=1; i<=n; i++){
    derivatives(t_h,y_h,yout);
    y_m[1]=y_h[1]+0.5*yout[1]*delta_t_roof;
    y_m[0]=y_h[0]+0.5*y_h[1]*delta_t_roof;
    t_m=t_h+0.5*delta_t_roof;
    derivatives(t_m,y_m,yout);
    yout[1]=y_h[1]+yout[1]*delta_t_roof;
    yout[0]=y_h[0]+y_m[1]*delta_t_roof;
    fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    t_h+=delta_t_roof;
    y_h[0]=yout[0];
    y_h[1]=yout[1];
  }
  fout.close;
}

void pendelum::half_step()
{
  /*We are using the half_step_algorith.
    The algorithm is not self-starting, so we calculate
    v_1/2 by using the Euler algorithm. */

  int i;
  double t_h;
  double yout[2],y_h[2];

  t_h=0;
  y_h[0]=y[0];  //phi
  y_h[1]=y[1];  //v
  ofstream fout("half_step.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  /*At first we have to calculate v_1/2
    For this we use Euler's method:
    v_`1/2 = v_0 + 1/2*a_0*delta_t_roof
    For calculating a_0 we have to start derivatives
  */
  derivatives(t_h,y_h,yout);
  yout[1]=y_h[1]+0.5*yout[1]*delta_t_roof;
  yout[0]=y_h[0]+yout[1]*delta_t_roof;
  fout<<delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
  y_h[0]=yout[0];
  y_h[1]=yout[1];
  for(i=2; i<=n; i++){
    derivatives(t_h,y_h,yout);
    yout[1]=y_h[1]+yout[1]*delta_t_roof;
    yout[0]=y_h[0]+yout[1]*delta_t_roof;
    fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    t_h+=delta_t_roof;
    y_h[0]=yout[0];
    y_h[1]=yout[1];
  }
  fout.close;
}

void pendelum::rk2()
{
  /*We are using the second-order-Runge-Kutta-algorithm
    We have to calculate the parameters k1 and k2 for v and phi,
    so we use to arrays k1[2] and k2[2] for this
    k1[0], k2[0] are the parameters for phi,
    k1[1], k2[1] are the parameters for v
  */

  int i;
  double t_h;
  double yout[2],y_h[2],k1[2],k2[2],y_k[2];
  
  t_h=0;
  y_h[0]=y[0];  //phi
  y_h[1]=y[1];  //v
  ofstream fout("rk2.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  for(i=1; i<=n; i++){
    /*Calculation of k1 */
    derivatives(t_h,y_h,yout);
    k1[1]=yout[1]*delta_t_roof;
    k1[0]=yout[0]*delta_t_roof;
    y_k[0]=y_h[0]+k1[0]*0.5;
    y_k[1]=y_h[1]+k2[1]*0.5;
    /*Calculation of k2 */
    derivatives(t_h+delta_t_roof*0.5,y_k,yout);
    k2[1]=yout[1]*delta_t_roof;
    k2[0]=yout[0]*delta_t_roof;
    yout[1]=y_h[1]+k2[1];
    yout[0]=y_h[0]+k2[0];
    fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    t_h+=delta_t_roof;
    y_h[0]=yout[0];
    y_h[1]=yout[1];
  }
  fout.close;
}

void pendelum::rk4_step(double t,double *yin,double *yout,double delta_t)
{
  /*
    The function calculates one step of fourth-order-runge-kutta-method
    We will need it for the normal fourth-order-Runge-Kutta-method and
    for RK-method with adaptive stepsize control

    The function calculates the value of y(t + delta_t) using fourth-order-RK-method
    Input: time t and the stepsize delta_t, yin (values of phi and v at time t)
    Output: yout (values of phi and v at time t+delta_t)
    
  */
  double k1[2],k2[2],k3[2],k4[2],y_k[2];
  // Calculation of k1 
  derivatives(t,yin,yout);
  k1[1]=yout[1]*delta_t;
  k1[0]=yout[0]*delta_t;
  y_k[0]=yin[0]+k1[0]*0.5;
  y_k[1]=yin[1]+k1[1]*0.5;
  /*Calculation of k2 */
  derivatives(t+delta_t*0.5,y_k,yout);
  k2[1]=yout[1]*delta_t;
  k2[0]=yout[0]*delta_t;
  y_k[0]=yin[0]+k2[0]*0.5;
  y_k[1]=yin[1]+k2[1]*0.5;
  /* Calculation of k3 */
  derivatives(t+delta_t*0.5,y_k,yout);
  k3[1]=yout[1]*delta_t;
  k3[0]=yout[0]*delta_t;
  y_k[0]=yin[0]+k3[0];
  y_k[1]=yin[1]+k3[1];
  /*Calculation of k4 */
  derivatives(t+delta_t,y_k,yout);
  k4[1]=yout[1]*delta_t;
  k4[0]=yout[0]*delta_t;
  /*Calculation of new values of phi and v */
  yout[0]=yin[0]+1.0/6.0*(k1[0]+2*k2[0]+2*k3[0]+k4[0]);
  yout[1]=yin[1]+1.0/6.0*(k1[1]+2*k2[1]+2*k3[1]+k4[1]);
}

void pendelum::rk4()
{
  /*We are using the fourth-order-Runge-Kutta-algorithm
    We have to calculate the parameters k1, k2, k3, k4 for v and phi,
    so we use to arrays k1[2] and k2[2] for this
    k1[0], k2[0] are the parameters for phi,
    k1[1], k2[1] are the parameters for v
  */

  int i;
  double t_h;
  double yout[2],y_h[2]; //k1[2],k2[2],k3[2],k4[2],y_k[2];
  
  t_h=0;
  y_h[0]=y[0];  //phi
  y_h[1]=y[1];  //v
  ofstream fout("rk4.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  for(i=1; i<=n; i++){
    rk4_step(t_h,y_h,yout,delta_t_roof);
    fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    t_h+=delta_t_roof;
    y_h[0]=yout[0];
    y_h[1]=yout[1];
  }
  fout.close;
}



void pendelum::asc()
{
  /*
    We are using the Runge-Kutta-algorithm with adaptive stepsize control
    according to "Numerical Recipes in C", S. 574 ff.
    
    At first we calculate y(x+h) using rk4-method  => y1
    Then we calculate y(x+h) using two times rk4-method at x+h/2 and x+h  => y2
    
    The difference between these values is called "delta" If it is smaller than a given value,
    we calculate y(x+h) by  y2 + (delta)/15 (page 575, Numerical R.)
    
    If delta is not smaller than ... we calculate a new stepsize using
    h_new=(Safety)*h_old*(.../delta)^(0.25) where "Safety" is constant (page 577 N.R.)
    and start again with calculating y(x+h)...
   */
  int i;
  double t_h,h_alt,h_neu,hh,errmax;
  double yout[2],y_h[2],y_m[2],y1[2],y2[2], delta[2], yscal[2];
  
  const double eps=1.0e-6;
  const double safety=0.9;
  const double errcon=6.0e-4;
  const double tiny=1.0e-30;

  t_h=0;
  y_h[0]=y[0];  //phi
  y_h[1]=y[1];  //v
  h_neu=delta_t_roof;
  ofstream fout("asc.out");
  fout.setf(ios::scientific);
  fout.precision(20);
  for(i=0;i<=n;i++){
    /* The error is scaled against yscal
       We use a yscal of the form yscal = fabs(y[i]) + fabs(h*derivatives[i])
       (N.R. page 567)
    */
    derivatives(t_h,y_h,yout);
    yscal[0]=fabs(y[0])+fabs(h_neu*yout[0])+tiny;
    yscal[1]=fabs(y[1])+fabs(h_neu*yout[1])+tiny;
    /* the do-while-loop is used until the */
    do{
      /* Calculating y2 by two half steps */
      h_alt=h_neu;
      hh=h_alt*0.5;
      rk4_step(t_h, y_h, y_m, hh);
      rk4_step(t_h+hh,y_m,y2,hh);
      /* Calculating y1 by one normal step */
      rk4_step(t_h,y_h,y1,h_alt);
      /* Now we have two values for phi and v at the time t_h + h  in y2 and y1
	 We can now calculate the delta for phi and v
      */
      delta[0]=fabs(y1[0]-y2[0]);
      delta[1]=fabs(y1[1]-y2[1]);
      errmax=(delta[0]/yscal[0] > delta[1]/yscal[1] ? delta[0]/yscal[0] : delta[1]/yscal[1]);
      
      /*We scale delta against the constant yscal
	Then we take the biggest one and call it errmax */
      errmax=(double)errmax/eps;
      /*We divide errmax by eps and have only   */
      h_neu=safety*h_alt*exp(-0.25*log(errmax));
    }while(errmax>1.0);
    /*Now we are outside the do-while-loop and have a delta which is small enough
      So we can calculate the new values of phi and v
    */
    yout[0]=y_h[0]+delta[0]/15.0;
    yout[1]=y_h[1]+delta[1]/15.0;
    fout<<(double)(t_h+h_alt)/omega_0<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
    // Calculating of the new stepsize
    h_neu=(errmax > errcon ? safety*h_alt*exp(-0.20*log(errmax)) : 4.0*h_alt);
    y_h[0]=yout[0];
    y_h[1]=yout[1];
    t_h+=h_neu;
  }
}


int main()
{
  pendelum testcase;
  testcase.initialise();
  //  testcase.euler();
  //testcase.euler_cromer();
  //testcase.midpoint();
  //testcase.euler_richardson();
  //testcase.half_step();
  //testcase.rk2();
  testcase.rk4();
  return 0;
}  // end of main function

Classes for ODE methods

In Fortran we would use

MODULE pendulum
   USE CONSTANTS
   IMPLICIT NONE
   REAL(DP), PRIVATE :: Q, A_roof, omega_0, omega_roof,g
   REAL(DP), PRIVATE :: y(2)         ! for the initial-values of phi and v
   INTEGER, PRIVATE ::  n               ! how many steps
   REAL(DP), PRIVATE :: delta_t,delta_t_roof

   CONTAINS
    SUBROUTINE derivatives(..)
    SUBROUTINE initialise(..)
    SUBROUTINE euler(..)
    SUBROUTINE euler_cromer(..)
    SUBROUTINE midpoint(..)
    etc

END MODULE pendulum

Adaptive methods

In case the function to integrate varies slowly or fast in different integration domains, adaptive methods are normally used. One strategy is always to decrease the step size. As we have seen earlier, this leads to more CPU cycles and may lead to loss or numerical precision. An alternative is to use higher-order RK methods for example. However, this leads again to more cycles, furthermore, there is no guarantee that higher-order leads to an improved error.

Adaptive methods

Assume the exact result is $ \tilde{x} $ and that we are using an RKM method. Suppose we run two calculations, one with $ h $ (called $ x_1 $) and one with $ h/2 $ (called $ x_2 $). Then $$ \tilde{x}=x_1+Ch^{M+1}+O(h^{M+2}), $$ and $$ \tilde{x}=x_2+2C(h/2)^{M+1}+O(h^{M+2}), $$ with $ C $ a constant. Note that we calculate two halves in the last equation. We get then $$ |x_1-x_2| = Ch^{M+1}(1-\frac{1}{2^M}). $$ yielding $$ C=\frac{|x_1-x_2|}{(1-2^{-M})h^{M+1}}. $$ We rewrite $$ \tilde{x}=x_2+\epsilon+O((h)^{M+2}), $$ with $$ \epsilon = \frac{|x_1-x_2|}{2^M-1}. $$

Adaptive methods

With RK4 the expressions become $$ \tilde{x}=x_2+\epsilon+O((h)^{6}), $$ with $$ \epsilon = \frac{|x_1-x_2|}{15}. $$ The estimate is one order higher than the original RK4. But this method is normally rather inefficient since it requires a lot of computations. We solve typically the equation three times at each time step. However, we can compare the estimate $ \epsilon $ with some by us given accuracy $ \xi $. We can then ask the question: what is, with a given $ x_j $ and $ t_j $, the largest possible step size $ \tilde{h} $ that leads to a truncation error below $ \xi $? We want $$ C\tilde{h} \le \xi, $$ which leads to $$ \left(\frac{\tilde{h}}{h}\right)^{M+1}\frac{|x_1-x_2|}{(1-2^{-M})}\le \xi, $$ meaning that $$ \tilde{h}=h\left(\frac{\xi}{\epsilon}\right)^{1+1/M}. $$

Adaptive methods

With $$ \tilde{h}=h\left(\frac{\xi}{\epsilon}\right)^{1+1/M}. $$ we can design the following algorithm:

If the two answers are close, keep the approximation to $ h $.
If $ \epsilon > \xi $ we need to decrease the step size in the next time step.
If $ \epsilon < \xi $ we need to increase the step size in the next time step.

A much used algorithm is the so-called RKF45 which uses a combination of fourth and fifth order RK methods.

Adaptive methods, RKF45

At each step, two different approximations for the solution are made and compared. If the two answers are in close agreement, the approximation is accepted. If the two answers do not agree to a specified accuracy, the step size is reduced. If the answers agree to more significant digits than required, the step size is increased. Each step requires the use of the following six values: $$ k_1 = h f (t_k , y_k ), $$ $$ k_2 = h f (t_k + \frac{1}{4}h, y_k + \frac{1}{4}k_1) , $$ $$ k_3 = h f (t_k + \frac{3}{8}h, y_k + \frac{3}{32}k_1 + \frac{9}{32}k_2) , $$ $$ k_4 = h f (t_k + \frac{12}{13}h, y_k + \frac{1932}{2197}k_1 + \frac{7200}{2197}k_2+\frac{7296}{2197}k_3), $$ $$ k_5 = h f (t_k + h, y_k + \frac{439}{216}k_1 -8k_2+ \frac{3680}{513}k_3+\frac{845}{4104}k_4), $$ $$ k_6 = h f (t_k + \frac{1}{2}h, y_k - \frac{8}{27}k_1 + 2k_2-\frac{3544}{2565}k_2+\frac{1859}{4104}k_4-+\frac{11}{40}k_5). $$

Adaptive methods, RKF45

An approximation to the solution of the ODE is made using a Runge-Kutta method of order 4: $$ y_{k+1} = y_k + \frac{25}{216}k_1+\frac{1408}{2565}k_3 +\frac{2197}{4101}k_4-\frac{1}{5}k_5, $$ where the four function values $ k_1 $ , $ k_3 $ , $ k_4 $ , and $ k_5 $ are used. Notice that $ k_2 $ is not used here. A better value for the solution is determined using a Runge-Kutta method of order 5: $$ z_{k+1} = y_k + \frac{16}{135}k_1+\frac{6656}{12825}k_3 +\frac{28561}{56430}k_4-\frac{9}{50}k_5+\frac{2}{55}k_6. $$ The optimal time step $ \alpha h $ is then determined by $$ \alpha = \left( \frac{\xi h}{2|z_{k+1}-y_{k+1}|}\right)^{1/4}, $$ with $ \xi $ our defined tolerance.

Project 3: An object oriented example program for project 3

See https://github.com/htihle/Keplerproblem. To be discussed during the lab sessions in week 40

\( h \)	\( e^{h}+e^{-h} \)	\( e^{h}+e^{-h}-2 \)
\( 10^{-1} \)	2.0100083361116070	\( 1.0008336111607230\times 10^{-2} \)
\( 10^{-2} \)	2.0001000008333358	\( 1.0000083333605581\times 10^{-4} \)
\( 10^{-3} \)	2.0000010000000836	\( 1.0000000834065048\times 10^{-6} \)
\( 10^{-5} \)	2.0000000099999999	\( 1.0000000050247593\times 10^{-8} \)
\( 10^{-5} \)	2.0000000001000000	\( 9.9999897251734637\times 10^{-11} \)
\( 10^{-6} \)	2.0000000000010001	\( 9.9997787827987850\times 10^{-13} \)
\( 10^{-7} \)	2.0000000000000098	\( 9.9920072216264089\times 10^{-15} \)
\( 10^{-8} \)	2.0000000000000000	\( 0.0000000000000000\times 10^{0} \)
\( 10^{-9} \)	2.0000000000000000	\( 1.1102230246251565\times 10^{-16} \)
\( 10^{-10} \)	2.0000000000000000	\( 0.0000000000000000\times 10^{0} \)

Relations	Name	matrix elements
\( A = A^{T} \)	symmetric	\( a_{ij} = a_{ji} \)
\( A = \left (A^{T} \right )^{-1} \)	real orthogonal	\( \sum_k a_{ik} a_{jk} = \sum_k a_{ki} a_{kj} = \delta_{ij} \)
\( A = A^{*} \)	real matrix	\( a_{ij} = a_{ij}^{*} \)
\( A = A^{\dagger} \)	hermitian	\( a_{ij} = a_{ji}^{*} \)
\( A = \left (A^{\dagger} \right )^{-1} \)	unitary	\( \sum_k a_{ik} a_{jk}^{} = \sum_k a_{ki}^{} a_{kj} = \delta_{ij} \)

417/2=208	remainder 1	coefficient of \( 2^{0} \) is 1
208/2=104	remainder 0	coefficient of \( 2^{1} \) is 0
104/2=52	remainder 0	coefficient of \( 2^{2} \) is 0
52/2=26	remainder 0	coefficient of \( 2^{3} \) is 0
26/2=13	remainder 1	coefficient of \( 2^{4} \) is 0
13/2= 6	remainder 1	coefficient of \( 2^{5} \) is 1
6/2= 3	remainder 0	coefficient of \( 2^{6} \) is 0
3/2= 1	remainder 1	coefficient of \( 2^{7} \) is 1
1/2= 0	remainder 1	coefficient of \( 2^{8} \) is 1

\( x \)	\( \exp{(-x)} \)	Series	Number of terms in series
0.0	0.100000E+01	0.100000E+01	1
10.0	0.453999E-04	0.453999E-04	44
20.0	0.206115E-08	0.487460E-08	72
30.0	0.935762E-13	-0.342134E-04	100
40.0	0.424835E-17	-0.221033E+01	127
50.0	0.192875E-21	-0.833851E+05	155
60.0	0.875651E-26	-0.850381E+09	171
70.0	0.397545E-30	NaN	171
80.0	0.180485E-34	NaN	171
90.0	0.819401E-39	NaN	171
100.0	0.372008E-43	NaN	171

\( x \)	\( h=0.01 \)	\( h=0.001 \)	\( h=0.0001 \)	\( h=0.0000001 \)	Exact
0.0	1.000008	1.000000	1.000000	1.010303	1.000000
1.0	2.718304	2.718282	2.718282	2.753353	2.718282
2.0	7.389118	7.389057	7.389056	7.283063	7.389056
3.0	20.085704	20.085539	20.085537	20.250467	20.085537
4.0	54.598605	54.598155	54.598151	54.711789	54.598150
5.0	148.414396	148.413172	148.413161	150.635056	148.413159

Year	\( n \)
1950	\( n=20 \)	(Wilkinson)
1965	\( n=200 \)	(Forsythe et al.)
1980	\( n=2000 \)	Linpack
1995	\( n=20000 \)	Lapack
2012	\( n\sim 10^5 \)	Lapack

Matrix	\( {\bf A}{\bf x}={\bf b} \)	\( {\bf A}{\bf x}=\lambda{\bf x} \)
\( {\bf A}={\bf A}^* \)	Conjugate gradient	Lanczos
\( {\bf A}\ne {\bf A}^* \)	GMRES etc	Arnoldi

Slides from FYS3150/4150 Lectures

Sep 22, 2014

Week 34

Overview of week 34

Lectures and ComputerLab

Course Format

ComputerLab

Topics covered in this course

Syllabus FYS3150

Syllabus FYS3150

Syllabus FYS3150

Syllabus FYS3150

Syllabus FYS3150

Syllabus FYS3150

Overarching aims of this course

And, there is nothing like a code which gives correct results!!

Other courses in Computational Science at UiO

Extremely useful tools, strongly recommended

A structured programming approach

A structured programming approach

Getting Started

Makefiles and simple scripts

Hello world

Hello World

Hello World

Hello World

C++ Hello World

Brief summary

Brief summary

Serious problems and representation of numbers

Limits, you must declare variables

From decimal to binary representation

From decimal to binary representation

From decimal to binary representation

Integer Numbers

Loss of Precision

Loss of Precision

Loss of Precision

Loss of Precision

Loss of Precision

Loss of Precision

Loss of Precision

Loss of numerical precision

Loss of numerical precision

Loss of numerical precision

Loss of precision can cuae serious problems

Loss of precision, real numbers

Roundoff errors

More on loss of precision

A problematic case

Program to compute \( \exp{(-x)} \)

Program to compute \( \exp{(-x)} \)

Program to compute \( \exp{(-x)} \)

Results \( \exp{(-x)} \)

Program to compute \( \exp{(-x)} \)

Program to compute \( \exp{(-x)} \)

Results \( \exp{(-x)} \)

Most used formula for derivatives

Error Analysis

Error Analysis

Error Analysis

Error Analysis

Error Analysis

Error Analysis

Week 35

Overview of week 35

Technical Matter in C/C++: Pointers

Technical Matter in C/C++: Pointer example I

Dissection: Pointer example I

Pointer example II

Dissection: Pointer example II

Output of Pointer example II

File handling; C-way

File handling; C way cont.

File handling, C++-way

File handling, C++-way

File handling, C++-way

File handling, C++-way

File handling, C++-way

Call by value and reference